Enhancing Healthcare Operations Through Generative AI: Contact Center Intelligence and Clinical Note Quality Assessment
View/ Open
Date
2026-04-19Type of Degree
PhD DissertationDepartment
Computer Science and Software Engineering
Restriction Status
EMBARGOEDRestriction Type
FullDate Available
04-20-2031Metadata
Show full item recordAbstract
This dissertation presents two research contributions at the frontier of AI-driven healthcare documentation. The first contribution is an end-to-end AI system for health- care contact center operations that integrates automatic speech recognition, large lan- guage model-based call summarization, custom domain-specific sentiment analysis, and transformer-based topic labeling into a unified pipeline. The system was designed to address the operational limitations of asynchronous, sample-based quality monitoring in Medicare Advantage contact centers, where the quality of member interactions directly influences CMS Stars Ratings and plan performance. A functional software prototype was developed and demonstrated through a dashboard interface supporting sentiment-driven triage, granular call search and filtering, and structured call review via a Call Information Card displaying SOAP-format summaries with sentence-level sentiment annotations. The sentiment analysis component was built on gold-standard labels derived from post-call member satisfaction surveys through a four-item aggregation methodology, validated by ordinal logistic regression analysis confirming negligible practical influence of call disposition on survey completion. A systematic benchmarking study evaluated four sen- tence embedding models (mxbai-embed-large-v1, all-MiniLM-L12-v2, all-mpnet-base-v2, and stella_en_400M_v5) combined with five classification algorithms, identifying the stella_en_400M_v5 embedding with LightGBM as the optimal combination with an AUC-ROC of 0.81. The second contribution is SOAP-QualiVal, a multi-agent, rubric-grounded evaluation framework for AI-generated SOAP clinical notes that addresses a previously uncharac- terized challenge termed the evaluator quality paradox: the finding that conventional multi-attribute utility theory scoring systematically advantages lenient LLM evaluators that detect fewer errors over stringent evaluators that detect more. SOAP-QualiVal employs four specialized evaluation agents assessing structural completeness, transcript- grounded factual accuracy, safety and actionability, and clinical usability, whose outputs are aggregated through hierarchical MAUT with fuzzy confidence propagation. The Advanced Evaluator Score (AES) resolves the paradox by ranking evaluators on intrinsic error detection metrics combined via TOPSIS multi-criteria optimization. Benchmarking on 207 aci-bench encounters using nine LLM evaluator models under two embedding configurations demonstrated the paradox empirically: GPT-5.1 ranked last (9th) under conventional utility scoring but first under AES, with a rank reversal of eight positions and a 123-fold difference in safety-critical error detection at deployment scale (ρ= +0.982 between AES and error detection; ρ=−0.619 between conventional utility and error detection). Bootstrap stability analysis confirmed perfect rank-lock for the top three evaluators across 1,000 resampling iterations. Together, these two contributions advance the responsible deployment of generative AI in high-stakes healthcare documentation by demonstrating both how to build reliable AI documentation systems and how to evaluate them rigorously at scale.
