This Is AuburnElectronic Theses and Dissertations

The Effects of Genetic-based and Swarm Intelligence-based Feature Selection on Adversarial Author Identification




Halladay, Steve

Type of Degree

PhD Dissertation


Computer Science and Software Engineering


Within the realm of author identification, where researchers work to classify writing samples by author, researchers are using more and diverse feature sets to try to improve classification accuracy. From a computational cost perspective, these additional feature sets become problematic. Further, adding more feature sets may inadvertently decrease classification accuracy. Therefore, selecting the appropriate subset of features is an important challenge for researchers. However, the feature subset selection concern becomes even more challenging due to a couple of complexities. The first complexity is that different datasets require different feature sets for good identification performance. A feature set that performs well with one dataset may not perform well with another. So, it is important to customize the feature set to the characteristics of the dataset. The second complexity is that it appears that feature selection makes author identification systems more susceptible to adversarial attacks. These attacks occur when authors attempt to obfuscate their writing style or impersonate another author’s writing style. The focus of the research in this work is in this second area of complexity, namely, understanding the susceptibility of adversarial attacks on author identification systems due to feature selection. Specifically, this research investigates the susceptibility of adversarial attacks on author identification systems that use genetic-based and swarm intelligence-based feature selection. The intent of this research is to observe and characterize the factors affecting adversarial susceptibility by considering several parameters, including dataset content, dataset size and feature selection algorithm. This work employs two datasets: the CASIS dataset, which is a collection of blog posts, and the PAN19 dataset, which is a collection of extracts from Twitter feeds and includes bot- generated writing samples. We vary the dataset sizes to ascertain the effects of a larger author pool. We also vary the bias towards minimizing the feature set. Then, we analyze the data to determine those factors that correlate with successful adversarial attacks on author identification systems both with and without feature selection.