Automatic Speech Disorder Assessment for Children’s Speech Disorder

Luan, Yaoxuan

View/Open

Yx_Dissertation_V050801.pdf (2.803Mb)

Date

2025-05-08

Author

Luan, Yaoxuan

Type of Degree

PhD Dissertation

Department

Computer Science and Software Engineering

Restriction Status

EMBARGOED

Restriction Type

Auburn University Users

Date Available

05-08-2026

Metadata

Show full item record

Abstract

Speech disorders in children present persistent challenges for early detection and intervention due to the complex, variable, and context-dependent nature of developing speech. Traditional automatic speech disorder detection (ASDD) systems, which rely heavily on handcrafted features such as Mel-Frequency Cepstral Coefficients (MFCCs), often struggle to capture the nuanced articulatory and prosodic patterns that characterize pediatric speech impairments. Recent advances in transformer-based deep learning architectures and self-supervised learning (SSL) offer promising alternatives for building more robust and interpretable ASDD systems. This dissertation investigates three complementary approaches to advancing ASDD through the integration of modern representation learning techniques. The first study examines the use of the Vision Transformer (ViT) architecture applied to MFCC features for the classification of disordered and non-disordered child speech. By leveraging the ViT’s patch-based attention mechanism, the study demonstrates that transformer-based models can achieve improved performance over conventional machine learning classifiers when applied to fixed acoustic feature representations. The second study evaluates the effectiveness of SSL-based speech representations, specifically those derived from wav2vec 2.0 and HuBERT, in detecting speech disorders in children. Through layer-wise analysis and speaker-independent classification experiments, this study confirms that SSL representations outperform MFCCs by capturing more detailed, context-aware acoustic cues. The third study explores an SSL-based perceptual similarity framework for measuring acoustic distances between speech samples. Using dynamic time warping (DTW) in the high-dimensional embedding space produced by SSL models, the study calculates similarity scores between utterances without relying on textual transcriptions. These distance metrics are shown to correlate strongly with clinical judgments of speech pronunciation accuracy and disorder severity, supporting their potential use in continuous monitoring or pre-diagnostic screening. Together, these studies provide a comprehensive evaluation of transformer-based and SSL-driven approaches for pediatric ASDD. The results highlight the advantages of using deep contextualized speech representations in terms of classification accuracy, robustness, and interpretability. The contributions offer a foundation for developing clinically viable tools to support early identification and longitudinal assessment of speech sound disorders in children.

URI

https://etd.auburn.edu//handle/10415/9794