Study of deep neural networks on graph data in a generative learning regime

Jiang, Chao

View/Open

Dissertation_Chao_Jiang_0806.pdf (5.007Mb)

Date

2022-08-04

Author

Jiang, Chao

Type of Degree

PhD Dissertation

Department

Computer Science and Software Engineering

Metadata

Show full item record

Abstract

Graph-formatted data is ubiquitous among different domains from social networks and academic citation networks to drug-target interactions and others. Graph neural networks (GNNs) have achieved outstanding performance in applying node classification, link pre- diction, and node clustering, etc. However, there are two common questions asked by researchers. First, how to get a considerable amount of labeled high quality data? Data quality is crucial in training deep neural network models. However, most of the current works in this area have focused on improving a model’s performance with the assumption that the preprocessed data are clean. Our first result is about improving data quality by removing noise information. Here we build a real knowledge graph from data sets LitCovid and Pubtator. The multiple types of biomedical associations of the real knowledge graphs, including the COVID-19-related ones, are based upon the co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (e.g., association predictions amongst genes, drugs, and diseases) have a high probability of false-positive predictions as the co-occurrences in literature do not always mean a true biomedical association between two entities. We proposed a framework that utilized generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two Generative Adversarial Network models, NetGAN and CELL, were adopted for the edge classification (i.e., link prediction), leveraging unlabeled link information based on the real knowledge graph. The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the promised method still achieved favorable results (AUCROC > 0.8 for synthetic and 0.7 for real dataset) despite the limited amount of testing data available. Second, what is the decision-making process of GNNs as it often remains a black box? In addition, many of the models are vulnerable to adversarial attacks. Our second result focuses on the study of the robustness of GNNs. Recent studies revealed that the GNNs are vulnerable to adversarial attacks, where feeding GNNs with poisoned data at training time can lead them to yield devastative test accuracy. However, the prior studies mainly posit that the adversaries can access freely and manipulate the original graph while obtaining such access could be too costly in practice. To fill this gap, we propose a novel attacking paradigm, named Generative Adversarial Fake Node Camouflaging(GAFNC), with its crux laying in crafting a set of fake nodes in a generative-adversarial regime. These nodes carry camouflaged malicious features and can poison the victim GNN by passing their harmful messages to the original graph via learned topological structures. These messages can maximize the devastation of classification accuracy (i.e., global attack) or enforce the victim GNN to misclassify a targeted node set into prescribed classes (i.e., target attack).

URI

https://etd.auburn.edu//handle/10415/8415