This Is AuburnElectronic Theses and Dissertations

Quality-Aware Data Crowdsourcing and Federated Learning in Wireless Networks

Date

2022-11-21

Author

Zhao, Yuxi

Type of Degree

PhD Dissertation

Department

Electrical and Computer Engineering

Abstract

Data crowdsourcing (referred to as ``crowdsourcing'' for brevity) has found a wide range of applications. In principle, crowdsourcing leverages the ``wisdom'' of a potentially large crowd of workers (e.g., mobile users) for tasks. One main advantage of crowdsourcing lies in that it can exploit the diversity of inherently inaccurate data from many workers by aggregating the data obtained by the crowd, such that the data accuracy (referred to as ``data quality'') after aggregation can substantially improve. Quality-aware crowdsourcing is beneficial as it makes use of workers' data quality to perform task allocation and data aggregation. However, a worker's quality and data can be her private information that she may have incentive to misreport to the crowdsourcing requester. Moreover, a worker's quality and data can depend on her sensitive information (e.g., location), which can be inferred from the outcomes of task allocation and data aggregation by an adversary. In addition, crowdsourcing is vulnerable to data poisoning attacks, where the attacker reports malicious data to reduce aggregated data accuracy. We study privacy-preserving crowdsourcing mechanisms for truthful data quality elicitation, and malicious data attacks on dynamic crowdsourcing. In federated learning (FL), machine learning (ML) models are trained distributively on edge devices without transmitting data samples from a large number of devices. In such a setting, the quality of a local model update is intimately related to the variance of the local stochastic gradient, which depends on the mini-batch data size used to compute the update. Wireless federated learning (WFL) can achieve collaborative intelligence in wireless edge networks. A general consensus is that WFL can support intelligent control and management of wireless communications and networks, and can enable many AI applications based on wireless networked systems. In distributed stochastic gradient descent which is a typical method of FL, the convergence rate of the trained machine learning model in FL depends heavily on which users participate in the learning process, given the heterogeneous quality of their local model updates and the unique features of wireless edge networks. The quality of a local parameter update is measured by the variance of the update, determined by the data sampling size (a.k.a. mini-batch size) used to compute the update. It is important to observe that the quality of local updates can be treated as a design parameter and used as a \textit{control ``knob''} (via the mini-batch size) to be adapted across users and over time. Such quality-aware distributed computation can substantially improve the learning accuracy of FL. To achieve a desired tradeoff between learning accuracy and communication and computation costs, participating devices of FL in each round and their local updates’ quality should be determined based on their impacts on the eventual training loss, as well as devices’ channel conditions and computation costs. We characterize performance bounds on the training loss as a function of local updates’ quality over the training process, for IID and non-IID data with convex setting, non-convex setting, and asynchronous setting. Based on the insights revealed by the performance bounds, we develop cost-effective dynamic distributed learning algorithms that adaptively select participating users and their mini-batch sizes, based on users’ communication and computation costs. In many applications of ML (e.g., image classification), the labels of training data need to be generated manually by human agents (e.g., recognizing and annotating objects in an image), which are usually costly and error-prone. The labeling of training data can be seen as data crowdsourcing. Given the strategic behavior of clients who may not make desired effort in their local data labeling and local model computation (quantified by the mini-batch size used in the stochastic gradient computation), and may misreport their local models to the FL server, we study characterizing the performance bounds on the training loss and devise labeling and computation effort and local model elicitation mechanisms which incentivize strategic clients to make truthful efforts as desired by the server in local data labeling and local model computation, and also report true local models to the server.