This Is AuburnElectronic Theses and Dissertations

Machine Learning Meta-analysis of Proteolytic Cleavage Specificity

Date

2024-07-30

Author

Kim, Suhyeon

Type of Degree

Master's Thesis

Department

Chemical Engineering

Abstract

Proteolytic enzymes, such as cathepsins and matrix metalloproteinases (MMPs), play crucial roles in various physiological processes, including metabolism, cell signaling, and apoptosis. Identifying their cleavage sites is a complex challenge due to the diverse substrate specificities and regulatory mechanisms of these enzymes. This thesis investigates the use of machine learning models, particularly Support Vector Machines (SVMs) and One-Class Support Vector Machines (OCSVMs), to predict proteolytic cleavage specificity. The study introduces a novel approach utilizing Fourier Transform-based encoding of peptide sequences to capture essential biochemical properties and structural characteristics, which are used as inputs into SVM algorithms. The research encompasses a comprehensive meta-analysis using SVM-based feature selection techniques to compare and contrast the substrate specificity of different proteases. This analysis aims to uncover distinct patterns in substrate interaction, offering valuable insights for therapeutic strategies and biomarker discovery. The datasets used in this study were sourced from the MEROPS database and included both positive data points (cleaved sequences) and synthetic negative data points (non-cleaved sequences) to ensure robustness and diversity. Through rigorous cross-validation and hyper-parameter optimization, the SVM models demonstrated high predictive accuracy, achieving Area Under the Receiver Operating Characteristic (AUC-ROC) scores close to 1.00 for several proteases. The study also explores the performance of OCSVM models, both with and without negative class data, revealing that tailored feature selection and weighting strategies significantly enhance model performance. The findings of this research underscore the potential of machine learning techniques in advancing bioin- formatics and protease research. The developed models not only improve the precision of proteomic analyses but also support the broader field of precision medicine by providing deeper insights into protease functions in health and disease.