Advanced Clustering Methods and Applications for Data Visualization
Type of DegreePhD Dissertation
Electrical and Computer Engineering
MetadataShow full item record
Categorizing all sorts of data without any label information is one of the most important tasks in preprocessing raw data. To achieve that, an unsupervised learning technique, called clustering, is utilized. Generally, data are clustered based on their similarities or dissimilarities. However, in this big data era, the conventional clustering algorithms turn out to execute slow or perform with low clustering accuracy. Presented in this work include three clustering methods to accelerate clustering process and improve clustering accuracy. The grid-based clustering is designed to fast process large amount of data. Combining with density-based clustering, the clustering method based on grid and density is capable of categorizing data with almost linear time complexity. Inspired by natural mountain ridges, clustering method by finding data mountain ridges analyzes data layout as different mountain ridges. Each mountain ridge stands for one cluster. The fuzzy assignment is adopted to calculate data density, which is the data mountain height. It can find out the desired clusters without giving the number of clusters. More importantly, it has the capability of clustering data with complex shapes and noise. Partitional clustering categorizes data into many microclusters. Known these microclusters, merging them into large clusters can provide us high accurate clustering results. Clustering method by analyzing density consistency and the minimum internal and external distance ratio takes the challenge to develop a strategy to determine whether to merge/ ii agglomerate microclusters or not so that clustering results with high clustering accuracy are obtained. Besides, in the big data era, the demand for data visualization is increasing. As it is known that high dimensional data cannot be viewed by human beings, various dimension reduction approaches are explored to map/embed these data to a two-dimensional plane or three- dimensional space so that people can visualize them. Multi-Dimensional Scaling is a set of methods for dimension reduction purpose. Nevertheless, it suffers from mass data overlaps and no revelation of data relations. In this work, visualization algorithm combining with clustering is presented as well. It shows larger margins between clusters and the connection relations between data in the resulted figures. In conclusion, the accomplishments in this work are as follows: (1) Clustering time performance is boosted by applying grid technique; (2) It is capable of clustering data with complex shapes and noise by finding data mountain ridges; (3) High clustering accuracy is achieved by analyzing data density and the minimum internal and external distance ratio; (4) A more accurate solution for MDS purpose is obtained by applying LM algorithm; (5) Cluster separation regions are enlarged in the embedding results using density concentration; (6) High dimensional data relations are revealed and illustrated in the embedding results. Experiments were implemented on clustering and visualizing synthetic and real-world datasets to verify the effectiveness of the clustering and visualization methods introduced above.