This Is AuburnElectronic Theses and Dissertations

Hybrid Machine Learning Techniques for Manufacturing and Beyond

Date

2021-07-19

Author

Lee, Jangwon

Type of Degree

PhD Dissertation

Department

Chemical Engineering

Abstract

This dissertation presents research performed to develop a novel soft sensor, feature space process monitoring, and domain knowledge-based path analysis for manufacturing and healthcare industries. In recent years, as the Internet of Things (IoT) and data storage techniques (i.e., cloud service) have been evolved, large-scale data are available to various industries such as retail, healthcare, and manufacturing. With notable demonstrations from the world’s largest companies, such as Google, Amazon, Facebook and Microsoft, that insights can be obtained from big data, many businesses and institutions have been utilizing their own big data for potentially making new inferences and solving challenging problems in data-driven ways. However, it is sometimes difficult to extract valuable information and gain insights from big data with rote application of machine learning (ML) since data collected from various sources may not be relevant and often contain noises. Without domain knowledge, the results from ML approaches can be incomplete, or even lead to misleading conclusions. Therefore, in this research I aim to demonstrate the limitations of pure data-driven ML techniques in several case studies that are relevant to manufacturing and healthcare, and then to address the limitations by developing solutions that systematically integrate domain knowledge with ML techniques. In the first part of this work (Chapter 2), I introduce a novel spectroscopy-based soft sensor which was developed by integrating a feature engineering approach – Statistics Pattern Analysis (SPA) – with a new feature selection approach – Consistency Enhanced Evolution for Variable Selection (CEEVS) – referred to as SPA-CEEVS. Based on the understanding of the spectral dataset that not all features contribute equally to the sample properties, a novel feature selection method, CEEVS, is proposed to identify truly relevant features that are associated with chemical functional group regions, leading to improved soft sensor performance and easier interpretation of results compared to the soft sensor based on the original spectroscopy data. SPA, one of the feature engineering methods, is embedded in the CEEVS algorithm to better capture the characteristics of spectra such as nonlinearity. SPA can also reduce the influence of spectral disturbance and background noise by extracting statistics and shape features from spectral data. To demonstrate the effectiveness of the proposed SPA-CEEVS method, comparison study of various variable selection methods and nonlinear models are conducted on several industrial near-infrared (NIR) spectral datasets. In the second part of this work (Chapter 3), I propose a data-driven feature space monitoring (FSM) approach that monitors periodic operations of pressure swing adsorption (PSA) processes. In FSM framework, features extracted from process variables are used to monitor the process operation, instead of raw process variables themselves. Domain knowledge of the PSA process helps to understand which features need to be generated and selected. In this work, I suggest a way of selecting features based on this domain knowledge. In addition, the FSM based fault detection method addresses challenges in monitoring periodic processes, such as unequal step and/or cycle time that requires trajectory alignment or synchronization for the traditional statistical process monitoring (SPM) methods. In this study, the k-nearest neighbor-based FSM (FSM-kNN) is developed for fault detection. The basic idea of FSM-kNN is that the distance between a faulty cycle and its neighboring training cycles (consisting of normal operation cycles) is greater than that between a normal cycle and its neighboring training cycles. In addition, a step-wise fault diagnosis is proposed to identify the root cause of faults when faults are detected. The proposed method not only shows superior fault detection performance compared to the conventional SPM methods for both simulated faults and real faults from an industrial PSA process, but also correctly identifies the root causes of the faults. In the third part of this work (Chapter 4), path analysis based on domain knowledge is proposed to examine if the hospitals specialized in certain diseases achieve better results in terms of costs and patient outcomes. With domain knowledge in healthcare industry, I formulate some hypotheses and construct paths. Pure data-driven ML approaches without hypotheses such as multiple linear regression and partial least square regression can lead to incomplete conclusion because they consider only one path among all the possible paths. However, the path analysis consists of all the possible paths where hospital specialization can affect the hospital performance so that the model can reveal full effects of hospital specialization. The comparison between the path analysis and the pure data-driven ML approaches suggests that domain knowledge can play a critical role in machine learning applications and should be incorporated whenever possible. The contribution of this work and potential future work are summarized in Chapter 5. As demonstrated in this work, pure data-driven ML techniques have many limitations. For example, the results of pure data-driven ML tend to be sensitive to training datasets, possibly leading to irrational conclusions. The new feature selection and feature engineering techniques proposed in this work can improve the robustness and reliability of ML by reducing the influence of noises and disturbances, and capturing better process characteristics. In addition, the pure data-driven ML may fail to reveal the comprehensive effect of explanatory variables on a response variable. This work investigates the role of domain knowledge and how it enables us to establish knowledge-guided model structure. This structure would entail all logical paths to explain various influences of explanatory variables on a response variable, leading to complete and mechanistically interpretable results.