Studying the Applications of Probability Metrics and Divergence Measures in Solving Classic Control Tasks
Type of DegreeMaster's Thesis
Computer Science and Software Engineering
MetadataShow full item record
Choosing the correct statistical distance for a machine learning problem is vital when estimating the degree of dissimilarity between two discrete distributions. In the distributional reinforcement learning problem, the distribution of returns that can be obtained by an agent are approximated across the entirety of the state space. To describe the expected behavior of the agent as it interacts with the environment in the distributional setting, the C51 algorithm initially proposed using the Wasserstein distance due to the convergence guarantees it offered for the policy evaluation problem. However due to the biased sample gradients produced by the Wasserstein distance, the KL divergence was ultimately used as the categorical loss function in the C51 algorithm. In this thesis we studied two potential class of statistical distances and empirically observed their performance as viable categorical loss functions in the C51 algorithm as compared to the KL divergence. The first were probability metrics such as the Sinkhorn divergence and the Energy distance which attempt to alleviate the poor sample and computational complexity of the exact Wasserstein distance. The second were divergence measures that were instances of both the f divergence and α divergence. We studied the training time and testing time performance of these variations on the Acrobot and Cartpole environments. We demonstrated that the statistical distances most suitable for approximating value distributions in these environments were divergence measures that possessed the zero-avoiding property or an amalgamation of zero-avoiding and zero-forcing properties. Strictly zero-forcing divergence measures were unsuitable for use as a categorical loss function in these environments. The Sinkhorn divergence was ill suited to serve as a categorical loss function whereas the Energy distance demonstrated evidence of learning in these environments, although its training performance paled in comparison to the more successful crop of divergence measures. This indicated that if an optimal transport based categorical loss function was to be used in the C51 algorithm, maximal entropic regularization would have to be applied.