Distributed Listening in Automatic Speech Recognition by Yolanda McMillian A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 9, 2010 Keywords: Automatic Speech Recognition, Spoken Language Systems, Distributed Listening Approved by Juan E. Gilbert, Chair, Professor of Computer Science and Software Engineering Cheryl Seals, Associate Professor of Computer Science and Software Engineering Gerry Dozier, Professor of Computer Science and Software Engineering ii Abstract While speech recognition systems have come a long way in the last forty years, there is still room for improvement. Although readily available, these systems are sometimes inaccurate and insufficient. The research presented here outlines a technique called Distributed Listening which demonstrates noticeable improvements to existing speech recognition methods. The Distributed Listening architecture introduces the idea of multiple, parallel, yet physically separate automatic speech recognizers called listeners. Distributed Listening also uses a piece of middleware, called an interpreter, which resolves multiple interpretations using a phrase resolution algorithm. The subsequent experiments of the research show that these efforts work together to increase the accuracy of the transcription of spoken utterances and Distributed Listening at worst, is as good as the best individual listener. iii Acknowledgments I want to thank my family and friends who provided unending support and encouragement through this entire process. Specifically, my mother, Eloise McMillian, my sisters, Shauna and Tonya McMillian, my father Joseph McMillian, who lives on in the loving memory of his family, and my support circle, Hilary Boyd, Regina Bolden, Nicole Harris, Cynithia Landry, Tamika Austin, Kenitra Fewell, and Shalonna Banks. I especially want to thank my advisor, Dr. Juan Gilbert, for his patience and mentoring. I would also like to thank my committee members, Dr. Gerry Dozier and Dr. Cheryl Seals, along with my outside reader, Dr. Jared Russell, for their assistance in making this possible. I also must thank the members of the Human Centered Computing Lab, both past and present, for their unconditional help and advice, especially Jerome McClendon, Philicity Williams, Dr. Ken Rouse, Dr. E. Vincent Cross, Kamilah Walker, Michele Williams, Kenishia Sapp, and Dr. Dale-Marie Wilson. Most of all, I give thanks to my Lord and Savior Jesus Christ, because through Him, I can do anything but fail. iv Table of Contents Abstract ............................................................................................................................... ii Acknowledgments.............................................................................................................. iii List of Tables .................................................................................................................... vii List of Figures .................................................................................................................... ix 1 Introduction .................................................................................................................. 1 1.1 Motivation ........................................................................................................ 1 1.2 Problem Description ........................................................................................ 3 1.3 Overview of Research Goals, Approaches and Contributions ........................ 4 1.4 Organization .................................................................................................... 4 2 Literature Review ........................................................................................................ 6 2.1 Automatic Speech Recognition (ASR) Systems .............................................. 6 2.2 Enhanced Majority Rules ................................................................................ 8 2.3 Virtual Intelligent Codriver ........................................................................... 11 2.4 Recognized Output Voting Error Reduction ................................................. 15 2.5 Modified ROVER .......................................................................................... 17 2.6 Post-Labeling Integration .............................................................................. 19 2.7 Multiple Japanese LVCSR Models ............................................................... 21 2.8 Segmental Minimum Bayes-Risk Recognition ............................................. 23 2.9 Posterior Probability Decoding with Confidence Estimation ........................ 26 v 2.10 Adaptive Language Models ........................................................................... 27 3 System Design ........................................................................................................... 30 3.1 Problem Statement ......................................................................................... 30 3.2 Design Principles ........................................................................................... 31 3.3 System Features ............................................................................................. 31 3.3.1 Listeners ............................................................................................. 31 3.3.2 Interpreters ......................................................................................... 32 4 Experiment Design .................................................................................................... 54 4.1 Internal Review Board ................................................................................... 54 4.2 Data ................................................................................................................ 55 4.3 Environment .................................................................................................. 58 4.4 Materials ........................................................................................................ 59 4.4.1 Listeners ............................................................................................. 59 4.4.2 Dragon NaturallySpeaking 10 ........................................................... 60 4.4.3 Corpus ................................................................................................ 60 4.5 Procedure ....................................................................................................... 61 4.5.1 Stereo Mix Option ............................................................................. 61 4.5.2 External Speaker and External Microphone Option .......................... 62 4.5.3 3.5mm Cable Option .......................................................................... 63 5 Research Findings ...................................................................................................... 65 5.1 Stereo Mix Option ......................................................................................... 68 5.2 External Speaker and External Microphone Option ...................................... 72 5.3 3.5mm Cable Option ...................................................................................... 76 vi 5.4 n-Listeners Combinations .............................................................................. 80 5.4.1 Stereo/Speaker Combination ............................................................. 81 5.4.2 Stereo/Cable Combination ................................................................. 84 5.4.3 9-Listener Combination ..................................................................... 87 5.5 Discussion ...................................................................................................... 91 6 Conclusion ................................................................................................................. 94 6.1 Contributions ................................................................................................. 94 6.2 Directions for Future Research ...................................................................... 95 6.2.1 Architectures ...................................................................................... 95 6.2.2 Parallel Processing ............................................................................. 97 6.3 Summary ........................................................................................................ 98 References ....................................................................................................................... 100 Appendix A: Recognition Engines Results ? Stereo Mix ............................................... 103 Appendix B: Recognition Engines Results ? External Speaker/Microphone ................. 112 Appendix C: Recognition Engines Results ? 3.5mm Auxiliary Cable ........................... 121 Appendix D: Internal Review Board Approval .............................................................. 130 vii List of Tables Table 1 Barry (et. al. 1994) Experiment 1 Vocabulary....................................................... 9 Table 2 Barry (et. al. 1994) Experiment 2 Vocabulary..................................................... 11 Table 3 Modified ROVER Word Error Rates, Sentence Error Rates, and Perplexity...... 18 Table 4 Modified ROVER Word Error Rates................................................................... 19 Table 5 Duchnowski (1993) Breakdown of TIMIT Database .......................................... 21 Table 6 Adaptive Language Models Word Error Rates .................................................... 28 Table 7 PRA ? DL Utterances Corpus .............................................................................. 37 Table 8 PRA Evidence Bigrams ....................................................................................... 37 Table 9 PRA Local Bigrams and Corresponding Frequency Counts ............................... 42 Table 10 Listener 1 Set of Candidate Phrases and Frequency Counts ............................. 50 Table 11 Listener 2 Set of Candidate Phrases and Frequency Counts ............................. 50 Table 12 Listener 3 Set of Candidate Phrases and Frequency Counts ............................. 51 Table 13 Master Interpreter Set of PRA Candidate Phrases ............................................. 51 Table 14 PRA Most Likely Spoken Phrase ...................................................................... 52 Table 15 Actual Spoken Phrases....................................................................................... 57 Table 16 Stereo Mix Option Recognition Accuracy Rates ............................................... 68 Table 17 Correct Semantic Interpretations of the Actual Spoken Phrase ......................... 69 Table 18 Agreement Between Listeners on Incorrect Recognitions ................................ 70 Table 19 Stereo Mix Word Error Rates ............................................................................ 72 viii Table 20 External Speaker/Microphone Option Recognition Accuracy Rates ................ 73 Table 21 External Speaker/Microphone Inconclusive Round .......................................... 74 Table 22 External Speaker/Microphone Word Error Rates .............................................. 76 Table 23 3.5mm Cable Option Accuracy Rates................................................................ 77 Table 24 3.5mm Cable Correct Semantic Interpretation of the Actual Spoken Phrase .... 77 Table 25 3.5mm Cable Word Error Rates ........................................................................ 80 Table 26 Stereo/Speaker Combination Recognition Accuracy Rates ............................... 81 Table 27 Stereo/Speaker Combination Word Error Rates ................................................ 84 Table 28 Stereo/Cable Combination Accuracy Rates ....................................................... 85 Table 29 Stereo/Cable Combination Round 1 .................................................................. 85 Table 30 Stereo/Cable Combination Word Error Rates ................................................... 87 Table 31 Stereo/Speaker/Cable Combination Recognition Accuracy Rates .................... 88 Table 32 Stereo/Speaker/Cable Combination Round 1 .................................................... 89 Table 33 Stereo/Speaker/Cable Combination Word Error Rates ..................................... 91 ix List of Figures Figure 1 Distributed Listening Architecture ....................................................................... 2 Figure 2 Barry (et. al. 1994) Hardware Configuration ....................................................... 8 Figure 3 Barry (et. al. 1994) Adjusted Overall Accuracy Measure .................................. 10 Figure 4 Brutti (et. al. 2004) Distributed Listening Architecture ..................................... 12 Figure 5 Brutti (et. al. 2004) Sample Dialogue ................................................................. 13 Figure 6 ROVER Architecture.......................................................................................... 16 Figure 7 Modified and Standard ROVER Example WTN ............................................... 18 Figure 8 Duchnowski Block Diagram of the Proposed Recognizer ................................. 20 Figure 9 Multiple Japanese LVCSR Models Recall and Precision Formulas .................. 22 Figure 10 e-ROVER Joining Two Correspondence Sets .................................................. 24 Figure 11 e-ROVER WTN Construction .......................................................................... 24 Figure 12 N-Best ROVER and e-ROVER Comparative Analysis ................................... 25 Figure 13 Posterior Probability Example Confusion Network ......................................... 27 Figure 14 Example Bigrams ............................................................................................. 33 Figure 15 Bigrams ER Diagram ....................................................................................... 33 Figure 16 Evidence-Based Phrase Resolution Bigram Creation ...................................... 35 Figure 17 PRA Round 1 Valid Bigram Creation and Local Frequency Count ................ 38 Figure 18 PRA Round 2 Valid Bigram Creation and Local Frequency Count ................ 38 Figure 19 PRA Rounds 3 and 4 Valid Bigram Creation and Local Frequency Count ..... 39 x Figure 20 PRA Rounds 5 and 6 Valid Bigram Creation and Local Frequency Count ..... 40 Figure 21 PRA Rounds 7 and 8 Bigram Creation and Local Frequency Count ............... 41 Figure 22 Listener 1-Round 1 PRA Iteration .................................................................... 42 Figure 24 Listener 1-Round 2 PRA Iteration .................................................................... 43 Figure 25 Listener 1-Round 2 Candidate Phrase Concatenations .................................... 44 Figure 26 Listener 1-Round 3 PRA Iterations .................................................................. 44 Figure 27 Listener 1-Round 3 PRA Concatenations......................................................... 45 Figure 28 Listener 1-Round 4 PRA Iteration and Concatenation ..................................... 45 Figure 29 Listener 1-Rounds 5 and 6 PRA Iterations ....................................................... 46 Figure 30 Listener 1-Round 7 PRA Iteration .................................................................... 47 Figure 31 Listener 1-Round 7 PRA Concatenations......................................................... 48 Figure 32 Listener 1-Round 8 PRA Iteration .................................................................... 48 Figure 33 Listener 1-Round 8 PRA Concatenations......................................................... 49 Figure 33 External Speaker/Microphone Experiment Setup ............................................ 63 Figure 34 3.5mm Cable Experiment Setup ....................................................................... 64 Figure 35 Hybrid Distributed Listening Architecture ....................................................... 96 1 1 Introduction 1.1 Motivation Research in the area of speech and natural language processing has been on-going for over forty years (Natural Language Software Registry 2004 and Jurafsky 2000) with foundations in a number of overlapping disciplines (Jurafsky 2000); however, there is still room for improvement with mainstream speech recognition systems (Deng and Huang 2004). Spoken language is quite pervasive which leads to frustrations when spoken language systems do not meet a user?s expectations. In theory, these systems have the ability to save both time and money, all while executing on a consistent basis, something humans are not easily able to do. For example, an automated system can respond, ??tirelessly, patiently, perkily, consistently, and to the best of her abilities? time after time (Price 2010). Yet, the annoyance of such systems to a user far outweighs these advantages. Until these systems outperform a typical person in most areas, there will always be room for improvement. Additionally, until systems achieve a conversational and casual style of speech interaction, the challenges for current speech recognition technology will persist (Deng and Huang 2004). A final fundamental obstacle of mainstream spoken language systems is overcoming accuracy rates in noisy environments. Although strides have been made, there are still practical limitations that need to be addressed (Deng and Huang 2004). Since ASR systems have the ability to be 2 superior to people with regard to consistency and data management, research of these systems is unending. Distributed Listening will further research in this area. The concept is based around the idea of multiple speech input sources. Previous research activities involved a single microphone with multiple, separate recognizers that all yielded improvements in accuracy. Distributed Listening uses multiple, parallel speech recognizers, with each recognizer having its own input source. Each recognizer is known as a listener and works in parallel with the other listeners. Each listener also serves as an interpreter. Once input is collected from the listeners, one machine, the master interpreter, processes all of the input (see figure 1). Figure 1 Distributed Listening Architecture Master Interpreter 3 To process the spoken input, a phrase resolution algorithm is used. This approach is analogous to a crime scene with multiple witnesses (the listeners) and a detective (the interpreter) who pieces together the stories of the witnesses using his/her knowledge of crime scenes to form a hypothesis of the actual event. Each witness will have a portion of the story that is the same as the other witnesses. It is up to the detective to fill in the blanks. With Distributed Listening, the process is very similar. Each listener will have common recognition results and the individual interpreters will use a phrase resolution algorithm to propose phrases, with the master interpreter resolving conflicts. All domains that utilize spoken language systems can benefit from Distributed Listening. The increase in speech recognition accuracy will result in more effective communication by improving the automatic transcription of spoken words. 1.2 Problem Description Distributed Listening uses multiple perspectives collected from distributed speech recognizers working in parallel. Distributed Listening also uses a phrase resolution algorithm to reconcile the results of each recognizer. This approach addresses the followings issues found with current speech recognition systems: 1. Less than favorable recognition results in sub-optimal environments. This includes environments with considerable background noise and environments where the system has not been trained for a specific individual. 2. Bad recognition accuracy due to distorted input. Current systems use the same input source for each recognizer, so poor input will result in undesirable recognition rates. 4 1.3 Overview of Research Goals, Approaches and Contributions Distributed Listening answers the question, ?Is there a way to improve the accuracy of current speech recognition systems?? Researchers found that speech recognition accuracy rates fall to 0% when grammars reach a certain size (Gilbert 2003). While current speech recognition systems are robust and usable, in certain domains that is not enough. For example, students with hearing impairments have enough trouble keeping the pace of the other students in the classroom. The greater the accuracy of the speech recognition system, the better it is for these students. To enhance speech recognition systems, Distributed Listening aims to simulate the way people hear. Humans use a psychological method called Dichotic Listening, where people listen to different voices in each ear, at the same time (Bruder 2004). It?s a natural extension to enable systems to hear in a manner similar to people. The success of Distributed Listening will benefit not only the field of computer science, but society in general. A more accurate speech recognition system can be applied to all domains that utilize spoken input resulting in a more precise form of communication. 1.4 Organization The following chapters will discuss this research task in detail. Chapter 2 examines the area of research that supports the development of Distributed Listening; Automatic Speech Recognition. Chapter 3 will present the research question and the approach that was used to support the hypothesis, including the system design and implementation details. Chapter 4 will explain the experiment that was performed, with a focus on the experiment design and settings. Chapter 5 will detail results, including a comprehensive 5 data analysis and discussion. Chapter 6 provides the summary, along with contributions and directions for future research. The document ends with a reference list and appendices. 6 2 Literature Review This chapter provides a description of Automatic Speech Recognition (ASR) systems, which is the core and concentration of this body of work. The scope of ASR systems is far-reaching, yet the research presented here is focused on ASR systems that use multiple speech recognizers. Thus, this chapter will also provide a review of such systems, which is the fundamental focus of this project. 2.1 Automatic Speech Recognition (ASR) Systems Automatic speech recognition systems convert a speech signal into a sequence of words, usually based on the Hidden Markov Model (HMM) (Young 1990), in which words are constructed from a sequence of states (Baum 1972, Furui 2002), based on an extensive vocabulary of training data (Price 2010). The speech signal itself is based on phonemes (Young 1989). The sequence of words produced from the speech signal is returned as transcribed text of the speech input. ASR systems do not include a component that determines the identity of the speaker, nor do ASR systems determine the meaning of the words that are spoken. Such systems are known as speaker recognition/verification and natural language processing, respectively, and are separate entities from ASR systems. Among other things, these systems must overcome two obstacles (Price 2010): 1. Noise: The environment that surrounds the user directly impacts the accuracy of spoken language systems. In a controlled environment with minimal noise and 7 competing speech, it is common to have accurate recognition results. This accuracy degrades as background noise is introduced. 2. Dialect: The dialect of the user also directly impacts the accuracy of spoken language systems. Since such an large training set is needed for ASR systems, speakers who speech match the training set will notice better accuracy than those speakers who have a noticeably different dialect. Even with the obstacles that effect ASR systems, there are four definitive advantages of such systems (Furui 1989): 1. Users do not need a specialized skill, like typing, to use speech recognition systems. For most people, speech is an inherent skill that comes natural and is cultivated from an early age. 2. Using speech is significantly faster than other forms of communication like typing or writing. A user can communicate with speech up to 10 times faster than writing on paper. 3. ASR systems allow the use of multiple modalities. Meaning, users can speak while doing other activities with their hands, legs, eyes, or ears. 4. The input methods of automatic speech recognition systems are economical. Specifically, microphones and telephones are very affordable. Given the inherent nature of speech and the pervasive and ubiquitous qualities of computers, it is not surprising that ASR systems are heavily researched as practical applications to supplement everyday life. Within this domain, a focus has been maintained on multiple ASR systems that work together to improve accuracy. A review of such systems will be presented in the remainder of this chapter. 8 2.2 Enhanced Majority Rules Barry (et. al. 1994) took three different Automatic Speech Recognition systems, along with an Enhanced Majority Rules (EMR) software algorithm and a fourth master system, to increase accuracy within the domain of aircraft cockpits, as shown in Figure 2. Figure 2 Barry (et. al. 1994) Hardware Configuration Each of the three individual systems received the same input, performed speech recognition, and sent the result to the master system. The result included the recognized word along with a distance score, as well as a second choice word and its distance score. The distance score was the confidence of the system in choosing a particular word. The inconsistencies from the three individual systems were resolved using the EMR algorithm. The EMR algorithm resolved these inconsistencies by first looking for agreement from 9 the individual systems for the recognized word. If there was no majority agreement, the EMR algorithm added the second word, with equal weight, to the collection of words and looked for a majority agreement. If there was still no agreement, the algorithm relied on the distance scores. At times, the individual systems would produce extra recognized words, or insertions. In those cases, the insertions were also added to the collection of words, and the algorithm proceeded in the same manner as described previously. This architecture was used to complete two experiments. There was a twofold objective to the experiments; 1) to determine the recognition accuracies of the individual systems using an ?easy? and a ?hard? vocabulary and 2) to determine if the addition of the EMR algorithm would produce accuracy rates greater than the rates produced by the individual systems. The first experiment relied on a simple, or easy, vocabulary (table 1) that consisted of 20 words common to the commands used by pilots in a cockpit, spoken by six male and six female pilots who were not experienced with ASR systems. Each pilot was randomly presented all 20 vocabulary words in 5 separate trials, resulting in 100 words spoken by each pilot. Before the start of the set of trials, which were run consecutively, each pilot trained the three individual systems to recognize his/her voice. Table 1 Barry (et. al. 1994) Experiment 1 Vocabulary 10 To determine if the resulting system produced better results than the individual systems, a measure of the mean Adjusted Overall Accuracy (AOA) was used along with a statistical gender by recognition system within subjects factorial design. The AOA consisted of a count of the number of correctly recognized words, the number of words presented, and the number of word insertions, as shown in Figure 3. Figure 3 Barry (et. al. 1994) Adjusted Overall Accuracy Measure The resulting data analysis showed that the EMR algorithm produced statistically significant better recognition accuracy than two of the three individual systems, but failed to do the same when compared to the third system. Additional analysis of the data showed that there was no effect of gender differences. The second experiment used the same architecture, algorithm and procedure, but differed in the robustness of the vocabulary that was used. The new vocabulary still reflected common words used in an aircraft cockpit environment, but was chosen based on the potential ?confusability? of the words in the vocabulary. The 25 words, as shown in table 2, were chosen based on how closely a word sounded like another word in the vocabulary. Ten of the 12 pilots from the first experiment participated in the second experiment, 11 along with 2 new participants. The resulting recognition accuracy and data analysis was the same as in the first experiment. Table 2 Barry (et. al. 1994) Experiment 2 Vocabulary Overall, the use of the EMR algorithm with three individual recognition systems produced better recognition accuracy than the individual systems. While an improvement was made, the architecture can suffer from distorted input. Since each system receives the same input, if the input signal is not good, then all of the individual systems will receive that bad input signal. 2.3 Virtual Intelligent Codriver The Virtual Intelligent Codriver (VICO) project also used multiple ASR systems in parallel within automobiles to increase the accuracy of hands-free communication inside a car (Brutti et. al. 2004, Cristoforetti et. al. 2003). The specific aim of the VICO project was to assist drivers in accessing tourist information and driving assistance while inside an automobile. The VICO structure had multiple ASR systems with each receiving the same input and having its own specialized language model (see figure 4). There were several distinct 12 language models as an alternative approach to one comprehensive language model and large vocabulary. The input that each system received was first processed and included an optional background noise reduction procedure. The multiple systems then performed independent speech recognition on the input. The resulting interpretations from each ASR system were then passed to a module that selected the final output. This output was chosen by comparing the interpretations to each other using confidence scores. The interpretation with the maximum likelihood was selected. Figure 4 Brutti (et. al. 2004) Distributed Listening Architecture The final output was then passed to a module that performed natural language understanding in that the module took the recognized interpretation and produced a semantically correct representation that could be passed to a dialogue manager module 13 and ultimately a response generator module. To produce the semantically correct representation, the natural language module parsed the recognized interpretation and identified select elements within the phrase. This module did not check for incomplete phrases, grammatically incorrect phrases, or other inconsistencies in the recognition. Rather, meaningful words of the phrase were identified and subsequently passed to the dialogue manager module. The dialogue manager module was then responsible for selecting responses to the spontaneous requests of the driver, based on the semantic information from the natural language understanding module. The selected response was then passed to the response generator with instructions on how to generate the response that would in turn undergo speech synthesis before being presented to the driver. An example dialogue of the VICO system is shown in figure 5. Figure 5 Brutti (et. al. 2004) Sample Dialogue 14 A number of experiments were performed within an automobile using an architecture that consisted of 5 ASR systems. The spoken data came from 8 female and 8 male speakers, with a combined total of 1612 utterances that translated to 9150 word occurrences and 918 vocabulary words. The research team was able to perform synchronous experiments by using a ?close-talk head-mounted? microphone, as well as a ?far-microphone? that was located on the ceiling of the automobile. The experiments showed a noticeable improvement over a structure with individual ASR systems. Specifically, the close-talk microphone experiment resulted in a 3.7% increase in recognition rates compared to using individual ASR systems. Likewise, the far-microphone experiment resulted in a 2.4% increase in recognition rates. The researchers found that while using the maximum likelihood to select the final output from the multiple ASR system did show a decrease in error rates, that method represents the simplest choice. The project team recognized this fact and noted other ASR systems reconciliation methods for future research that included confidence measures and word graph hypotheses. Although the VICO project did indeed show an improvement in speech recognition accuracy when using multiple ASR systems with specialized recognition units, there are two shortcomings. First, if the input signal is distorted, then each recognizer will receive bad input. Second, if each recognizer contains a piece of the optimal interpretation, then this architecture falls short. In order to address this problem, a post-processing combination algorithm is required. 15 2.4 Recognized Output Voting Error Reduction The Recognized Output Voting Error Reduction (ROVER) system is a composite of multiple ASR systems that uses a voting process to reconcile differences in the individual ASR system outputs (Fiscus 1997). The ROVER architecture, as shown in figure 6, consists of multiple recognition engines, an alignment module and a voting module. The multiple interpretations from the recognition engines are passed to the alignment module. The alignment module iteratively builds a composite linear topology Word Transition Network (WTN). The ROVER system subjectively selects one of the ASR system outputs to act as the base of the WTN. The WTN depicts the order and transition from one word to another in a given ASR system output. Each ASR system output is added to the WTN until a composite network is built that shows the word similarity, by position, between the individual ASR systems outputs. Once aligned, the voting module is called. The ROVER project investigated three different voting schemes. First, votes were tallied based on frequency of occurrence. Second, votes were tallied by frequency of occurrence along with the average word confidence score. Lastly, votes were tallied by frequency of occurrence along with the maximum word confidence score. No matter the voting scheme used, the voting module scores each word within the composite WTN vertically and the words with the highest scores are chosen, at any given position within the network. If there were ties between words, the ROVER system arbitrarily selects a winner. Tests were performed for each of the voting schemes, using 3 ASR systems for each experiment. 16 Figure 6 ROVER Architecture The first voting scheme performed the poorest, relative to the other voting schemes, but still achieved a decrease in the error rate compared to an individual system. This was the only voting scheme that resulted in ties within the WTN. Out of approximately 30,000 words, 5,320 resulted in a tie. The second voting scheme, which used average confidence scores, showed even more of an improvement over an individual system. The fact that this voting scheme did not have to resolve any word score ties was considered a significant improvement alone. The third voting scheme showed the best improvement over an individual system in addition to not having to resolve any word score ties. On average, this composite ASR system produced a lower error rate than any of the individual systems, but suffers from order of combination into the WTN and ties within the voting module. To overcome these shortcomings, a future research direction of this project was to investigate other voting methods that take advantage of different knowledge sources, including decision trees and artificial neural networks. Additionally, an alternative method for aligning the results of the individual ASR systems into the WTN by way of phonologic mediation is of interest to this research team. 17 2.5 Modified ROVER To solve the problem that resulted from the order of combination and ties of the original ROVER system, Schwenk proposed a modified ROVER system that used a dynamic programming algorithm built on language models (Schwenk and Gauvain 2000). To accomplish the task, the modified ROVER system took advantage of the composite WTN of the original ROVER system, where the most likely word is chosen from each branch of the WTN, but changed the order of the systems when producing the composite WTN. Moreover, the modified system also added a normalization procedure before combining the systems. The normalization mapped alternate spellings of words and abbreviated forms of words to a common form. In the case where the WTN produced more than one result, the Modified ROVER system used a language model to select the most likely word sequence based on contextual information. Meaning, when a position in the WTN resulted in a tie, all of the variations were kept, and the resulting word sequences were analyzed according to the perplexity of the sequence according to the language model. The word sequence that minimized perplexity was the sequence that was chosen. To ensure that the dynamic programming algorithm didn?t automatically prefer short word sequences, a penalty was applied for null arcs of the WTN. An example WTN is shown in figure 7, where incorrect words are underlined. 18 Figure 7 Modified and Standard ROVER Example WTN The modified ROVER system was tested using two different speech recognition corpora using a varying number of ASR systems, from 2 up to 9 combined systems. Table 3 displays the word error rate, sentence error rate, and perplexity of the improved ROVER system and compares the same rates with the original ROVER for the first corpus. When combining between 2 and 7 systems, an increase in accuracy is seen for the applicable metrics. Notice, however, that when combining 8 or 9 systems, the word error and sentence error rates are worse for the improved ROVER system, yet the perplexity is better. Table 3 Modified ROVER Word Error Rates, Sentence Error Rates, and Perplexity 19 The second experiment showed similar results. Between 2 and 5 systems were combined on a second corpus to produce a decrease in word error rates over the original ROVER system, as shown in table 4. The relative improvement as displayed in the table is relative to the best single recognizer that achieved a 17.1% word error recognition rate. Table 4 Modified ROVER Word Error Rates The experimental data analysis showed that using language model information is advantageous, but performance can degrade when too many systems are combined, especially those systems with the highest word error rates. Overall, the improved ROVER system resulted in a 5% relative word error reduction over the original ROVER system. 2.6 Post-Labeling Integration Paul Duchnowski researched a method termed Post-Labeling Integration (figure 8) that used multiple recognizers, called sub-recognizers, that worked independently before being integrated together to produce the final output decision (Duchnowski 1993). 20 Figure 8 Duchnowski Block Diagram of the Proposed Recognizer To accomplish the task, Duchnowski used a single speech signal that was filtered into frequency bands and parameterized before being fed into 4 sub-recognizers. The parameterization was consistent across each independent channel and was used to extract ?the most salient, information bearing features? of the speech signal. The individual outputs of the sub-recognizers were processed using a combination rule that merged the outputs to produce the recognized phones, or the smallest identifiable elements of speech. To integrate the outputs of the sub-recognizers, the speech signals were aligned by identifying the timing between successive phones and using probabilistic functions and a bigram language model to select the final sequence of phones. The subsequent experiments of this architecture were performed using the National Institute of Standards and Technology (NIST) version of the TIMIT database, which is a ?readily available, large, multi-speaker database that has been phonetically transcribed? and contains 630 speakers, with twice as many males as females. Each of the 630 speakers supplied 10 sentences and they collectively represented 8 regional dialects of 21 American English. Of the 6300 sentences available, a portion of those sentences was not used because they were the same for all of the speakers and would have put a bias towards certain phones. Table 5 shows the portions of the TIMIT database that were used for the experiments. Table 5 Duchnowski (1993) Breakdown of TIMIT Database The resulting data analysis showed that 54% of the phones were correctly recognized. Duchnowski stated that the resulting recognition rate was similar to comparable systems, but not a major improvement. It is worthwhile to mention that this research project is distinct from some in this chapter in that the focus was on the phone-level, as opposed to the word-level. 2.7 Multiple Japanese LVCSR Models This research project combined multiple Japanese LVCSR models and was also motivated by the ROVER project (Kodama et. al. 2001). The research team hypothesized that if a simple voting scheme such as the one used with ROVER could produce word error reduction then it is possible to further improve results by ?simply exploiting more than one speech recognizers? output?. To accomplish the task, the confidence of 22 agreement between the outputs of the systems was evaluated, with the systems having different acoustic models. There were two different acoustic models that were evaluated, the first being a phoneme- based HMM and the second being a syllable-based HMM, both based on Gaussian mixture HMM. The phoneme-based acoustic model was gender-dependent (male) and included 43 Japanese phonemes. The syllable-based HMM was also gender-dependent (male) and included 114 Japanese syllables. To evaluate the confidence of the combined Japanese LVCSR models, a metric that determined the recall/precision rate of estimating correctly recognized words was used, according to the formulas shown in figure 9. The formulas rely on an agreed word list that is a collection of those words that were aligned through dynamic programming and had identical lexical form. Figure 9 Multiple Japanese LVCSR Models Recall and Precision Formulas To train and test the system, two datasets were used. One dataset was composed of 100 newspaper sentence utterances spoken by 10 males and consisting of 1,565 words. This dataset was considered relatively easy for speech recognizers. The second harder dataset was comprised of 175 broadcast news speech utterances and consisted of 6,813 words 23 spoken by 10 male speakers, of which 8 were announcers and the other 2 were reporters. Half of the data were used for training and the other half were used for testing. The resulting data analysis showed that relying on the agreement between outputs of multiple LVCSR models, along with different acoustic models, performs well. Most notable is that the composite system was found to have quite high precision with less than 10% loss of recall from a single LVCSR model. The precision is over 99% accurate when the word recognition rate of a single LVCSR model is approximately 90% and over 93% accurate when the baseline word recognition rate is below 60%. 2.8 Segmental Minimum Bayes-Risk Recognition Goel et. al. (2000) were also motivated by the ROVER project and attributed the success of ROVER and subsequent voting schemes on the Minimum Bayes-Risk (MBR) framework. To further extend the research in this area, a segmental MBR Recognition procedure was developed with a derivative of the voting procedure to create N-best ROVER. Another extension was made that produced the Extended (e)-ROVER. N-best ROVER first constructed a WTN based on the N-best outputs of N systems. A posterior probability equation based on the WTN and a distribution was then used for each correspondence set of words. The word with the highest posterior probability from each set was selected and concatenated to produce the final output. The research team then improved segmentation that resulted in e-ROVER. This allowed two or more consecutive words in each correspondence set and was achieved by joining the two consecutive sets. The joined, expanded set replaced the two sets while maintaining the paths from the original sets (figure 10) within the WTN. The addition of the segmentation was further derived from a ?pinching? procedure that ignored those 24 word sets with a posterior probability above the pinching threshold. An example e- ROVER WTN is shown in figure 11. Figure 10 e-ROVER Joining Two Correspondence Sets Figure 11 e-ROVER WTN Construction 25 The experiments that were executed were based on a multi-lingual acoustic modeling task that included combined Czech recognition outputs from three systems. The three individual systems had error rates of 29.42%, 35.24%, and 29.22% and are used as baseline metrics for comparison of N-best ROVER and e-ROVER. The best individual system achieved an error rate of 29.22% and N-best ROVER was able to improve that baseline by 3.28%. An additional improvement of .56% was achieved from e-ROVER to produce a 3.84% improvement over the baseline. A comparative analysis of N-best ROVER and e-ROVER is shown in figure 12. Figure 12 N-Best ROVER and e-ROVER Comparative Analysis 26 Overall, the resulting tests provided a small, yet significant, improvement. The primary shortcoming of e-ROVER is the method used for pinching. The researchers believe pinching by overall Bayes-Risk and not just posterior probabilities will result in further improvements. 2.9 Posterior Probability Decoding with Confidence Estimation The ROVER project motivated another research team that evaluated posterior probabilities with confidence score estimation using multiple recognizers (Evermann and Woodland 2000). This project relied on confusion networks that provided a representation of the most likely word along with the associated word posterior probability. The posterior probabilities were used to create the confidence scores. The combination of confidence scores based on posterior probabilities, along with the confusion networks, were the features of the system. The word-level posterior probabilities were ?derived from the acoustic and language model (LM) likelihoods of the word sequences hypothesised by a Viterbi decoder? and represent the competing words and scores. The confusion network is a linear graph composed of a word lattice that has been clustered and transformed using a Viterbi decoder. The resulting confusion network was used with the dynamic programming alignment procedure of the original ROVER project and contains word posteriors (figure 13). 27 Figure 13 Posterior Probability Example Confusion Network To test the composite system, the CU-HTK system that was used in the March 2000 Hub5 Conversational Telephone Speech evaluation was applied and the acoustic models were trained using the Switchboard and CallHome corpora. The two training criteria used were the maximum likelihood estimation criterion and the maximum mutual information estimation criterion. The resulting data analysis showed that the use of confidence scores in this manner produced a decrease in error rates compared to the original ROVER system that used a simple voting scheme and the addition of the confusion network gave another small increase in accuracy rates. 2.10 Adaptive Language Models Solsona et. al. (2002) investigated multiple recognizers within a travel reservation system using state dependent Finite State Grammars (FSG) and context-independent n-grams. The researchers used two recognizers working in parallel with one recognizer based on n- gram statistical language models and the other on finite state grammars. Once the results from each recognizer were combined, an acoustic confidence measure was used to reconcile to one result. The confidence measure was a phone-based likelihood ratio. To select the final result, the sentence with the highest phone score was chosen. 28 The experiment that was executed was based on a trigram model and used a training corpus that consisted of 19,283 sentences, or 58,595 words, from the June 2000 DARPA Communicator data collection and database. The results from the experiment are shown in table 6. The last column in the table gives the average of the results for states 2 through 7, although states 5 through 7 are not listed in the table. This is due to the fact that states 2 through 4 contained the most number of training sentences and were the focus. The averages of the other states were given for completeness. Table 6 Adaptive Language Models Word Error Rates The 3-gram language model produced an average word error rate of 39.9% (for states 2- 4) and was considered as poor performance. The researchers attributed the poor performance to sparse training data. To overcome the sparseness, 15 semantic classes were introduced (class 3-gram) and the error rate was reduced to 21.1%. This shows that a class-based language model is effective for generalizing the language model. A further reduction in error rates was achieved by interpolating the state-independent class 3-gram with a state-dependent trigram (adapted 3-gram). The interpolation produced an average of 20.6%, which is a relative reduction of 2.5% over the class 3-gram model. Lastly, the 29 results from combining the general class trigram model and the state-dependent FSG (adapted FSG) showed a decrease from 21.1% to 18.5%. Overall, combining results from state-dependent FSGs and context-independent n-gram language models routinely outperforms the baseline and can achieve up to 12% relative reduction in error rates. While previous research activities using multiple ASR systems have resulted in improvements, they consistently used the same input source or relied on the alignment of recognition results to achieve an optimal result. The research presented here uses neither of those criteria in resolving the results from the multiple ASR systems, as discussed in the next chapter. 30 3 System Design Speech recognition is capable of 95% accuracy under optimal conditions, but optimal conditions are not always possible, which is why there isn?t mainstream use of speech technology. Distributed Listening attempts to improve the recognition accuracy achieved independent of the environment conditions. 3.1 Problem Statement Distributed Listening uses multiple perspectives collected from distributed speech recognizers. Distributed Listening also uses a phrase resolution algorithm to reconcile the results of each recognizer. This approach addresses the followings issues found with current speech recognition systems: 1. Less than favorable recognition results in sub-optimal environments. This includes environments with considerable background noise and environments where the system has not been trained for a specific individual. 2. Poor recognition accuracy due to distorted input. Current systems use the same input source for each recognizer, so poor input will result in undesirable recognition rates. Distributed Listening was developed in response to the aforementioned problems with a hypothesis that Distributed Listening will perform at worst, as good as the best 31 individual recognizer. The remainder of this chapter will describe the physical aspects, both hardware and software, of the system and how those aspects work together as a viable solution to the issues that have been presented. 3.2 Design Principles The purpose of Distributed Listening is to provide a more accurate speech recognition system. The main features of the system include: ? Multiple, yet distinct, input sources ? An accurate reconciler for the multiple recognition results These necessary features dictate the architecture of the Distributed Listening system, as described in the following section. 3.3 System Features Distributed Listening is composed of two significant parts. The first being listeners and the second being interpreters, which rely on the Evidence-Based Phrase Resolution Algorithm (PRA). In addition, Distributed Listening utilizes a speech corpus and a database. Each part is equally important and will be described in detail next. 3.3.1 Listeners Distributed Listening uses multiple speech recognizers to process the spoken input. Each recognizer is called a listener and is equipped with its own input source. Each listener is a separate, physical computing device with its own memory, processor, and disk space and works independently of the other listeners. Each listener collects input in the form of speech and performs interpretation that is ultimately used by the master interpreter to produce one result that serves as the most likely spoken phrase. 32 3.3.2 Interpreters Once input is collected by the listeners, each listener performs interpretation on the input using a resolution algorithm to produce a final set of candidate phrases. Each set of candidate phrases is then processed by the master interpreter to reconcile the variations in the results of the listeners. As a separate entity, the interpretation is not very powerful. Therefore, the interpretation works together with, and relies on, the resolution algorithm. The algorithm is dependent on one additional system feature; a corpus. The corpus will be defined next, followed by a detailed description of the resolution algorithm. 3.3.2.1 Corpus The corpus is a validation entity and is subject to the domain that utilizes Distributed Listening. It can be increased or decreased as necessary, based on the characteristics of the domain. In essence, the corpus is a database table of known utterances relative to a particular area. For this system, the utterances are recorded in bigram form, where a bigram is a 2-word pair as illustrated in figure 14. The corpus maintains a unique list of bigrams that people commonly speak, as well as the number of times the bigram is spoken (frequency), relative to the other bigrams in the corpus. This list is maintained as a database table, as defined in figure 15. The bigrams, or 2-word pairs, of the corpus resemble a bigram approach to language modeling and are used with the resolution algorithm to produce the most likely spoken phrase. The actual corpus used in the experimentation phase will be discussed in detail in chapter 4. 33 bigrams ( word1, word2, frequency ) The corpus works in combination with the PRA. The algorithm, and the supporting role of the corpus, will be described next. 3.3.2.2 Evidence-Based Phrase Resolution Algorithm (PRA) To resolve multiple recognitions from the listeners, the Evidence-Based Phrase Resolution Algorithm (PRA) is used. Recall from chapter 1 that Distributed Listening is analogous to a crime scene investigation, where witnesses correspond to listeners. The stories of the witnesses are the beginning of the evidence collection. Likewise, the Figure 15 Bigrams ER Diagram Figure 14 Example Bigrams today is a good day Bigram Bigram Bigram 2 Final Bigrams: ? today-is ? is-a ? a-good ? good-day Example Sentence: today is a good day Bigram Bigram 34 algorithm uses as evidence the actual bigrams from the recognitions of the listeners. Each listener?s recognition result is broken down into its individual bigrams and the collection of original bigrams from all of the listeners is retained as evidence. Next, the recognitions go through an iterative process where new bigrams are created by combining a word from one recognition with a word from a separate recognition, based on the position of the word within the recognition. The recognitions are not aligned. Rather, the first word from each listener is combined with the second word from each listener, followed by combining the second word from each listener with the third word from each listener and so on until the end of each individual recognition phrase. This procedure is best depicted as a 2-dimensional word matrix with resulting iterations shown as nested loops (figure 16). For recognitions that are shorter than others, when the end of that recognition is reached, it is no longer used to create bigrams. As a new bigram is created, it is validated against a corpus and the evidence. A bigram is considered valid if it is found within either the evidence or the corpus, otherwise the bigram is considered invalid. If the newly created bigram is deemed valid, it is added to a temporary table that maintains a local count of the frequency of that particular bigram. The valid bigrams that are created are concatenated together to form candidate strings that represent the most likely spoken phrase. It is important to note that the initial bigrams from the recognitions from the listeners are not validated against the corpus. This is so as not to delete the ?evidence? presented by the listeners. Should the recognitions from the listeners contain a bigram that is not in the corpus, that bigram should not be discarded as the recognitions from the listeners are the base, or the evidence, of the candidate phrases. 35 Once the iterative process is complete and the candidate strings have been created, each listener computes the local total frequency count of its candidate phrases. Specifically, for each bigram within a candidate string, a running total of the frequency of those bigrams is calculated. The candidate phrase or set of candidate phrases with the highest total is sent to the master interpreter. This iterative process happens in parallel with each listener. Next, the master interpreter receives the full set of candidate phrases from the listeners. The phrase within the set of phrases with the greatest frequency sum is chosen as the Figure 16 Evidence-Based Phrase Resolution Bigram Creation Word Matrix Position1 Position2 ? Positionn Phrase1 word11 word12 ? word1n Phrase2 word21 word22 ? word2n Phrasem wordm1 wordm2 ? wordmn Bigram Creation For ( i = 1 to n-1 ) For ( j = 1 to m ) For ( k = 1 to m ) Bigram = wordji + wordki+1 36 most likely spoken phrase. If there is a tie between candidate phrases, the phrase that is equal to the recognition result of a listener is selected. This is due to the likelihood that one of the distributed listeners heard the actual spoken phrase correctly. This selection process favors results that have evidence from the listeners, as opposed to the results that were created through concatenations. If there is still a tie between candidate phrases, those candidate phrases are validated against the corpus. If a phrase contains bigrams that are not found within the corpus, it is discarded and the remaining valid phrase is chosen as the most likely spoken phrase. If by chance there is still a tie, the total bigram frequency count for each phrase is re-calculated according to the corpus, instead of the local bigram frequency total, and the phrase with the greatest frequency sum is chosen as the most likely spoken phrase. Any additional ties are broken by arbitrarily choosing a phrase as the most likely spoken phrase. An example of the PRA will help put this into perspective. Assume there are three listeners and that the actual spoken phrase of the person is: ? These kids don?t deserve to be educated they say The actual spoken phrases as ?heard? by the three listeners are as follows: Listener 1: These kids don?t deserve to be educated basic Listener 2: These kids don?t deserve to be educated they say Listener 3: These kids don?t deserve to be educated these to Also assume there is a corpus called DL Utterances that contains 400,000 unique bigrams and corresponding frequency counts and is comprised of various newspaper text articles. The portion of the DL Utterances corpus that contains the bigrams from this example is shown in table 7. 37 Bigram Frequency Count Bigram Frequency Count Word 1 Word 2 Word 1 Word 2 be educated 5 these to 4 deserve to 18 these kids 72 don't deserve 5 they say 432 educated they 2 they to 3 kids don't 21 to be 5490 Table 7 PRA ? DL Utterances Corpus The first step of the algorithm is to store the evidence from the recognitions of all three listeners, resulting in an evidence bag that contains 11 unique bigrams (table 8). Bigram Bigram Bigram Word1 Word2 Word1 Word2 Word1 Word2 these kids kids don?t don?t deserve deserve to to be be educated educated basic educated they they say educated these these to Table 8 PRA Evidence Bigrams Next, the local bigram frequency count is calculated and begins by combining the first and second words of each listener. As bigrams are created and validated against either the evidence (table 8) or the corpus (table 7), they are put into a temporary table that maintains the unique list of bigrams and their cumulative total frequency count. In keeping with the PRA example, after the first round of iterations for each recognition result, the temporary table has the bigram ?these-kids? with a frequency count of 9, as shown in figure 17. Within the figure, indicates an empty word position. 38 The next round of iterations involves the second and third words from each listener (figure 18). Figure 17 PRA Round 1 Valid Bigram Creation and Local Frequency Count Figure 18 PRA Round 2 Valid Bigram Creation and Local Frequency Count Word Matrix Position1 Position2 Position3 ? Position9 Phrase1 these kids don?t ? Phrase2 these kids don?t ? say Phrase3 these kids don?t ? to Valid Bigram Creation Bigram Frequency these-kids 9 Word Matrix Position1 Position2 Position3 ? Position9 Phrase1 these kids don?t ? Phrase2 these kids don?t ? say Phrase3 these kids don?t ? to Valid Bigram Creation Bigram Frequency kids-don?t 9 39 The next two rounds of iterations and valid bigram creation are shown in figure 19. Word Matrix Position1 ? Position3 Position4 ? Position9 Phrase1 these ? don?t deserve ? Phrase2 these ? don?t deserve ? say Phrase3 these ? don?t deserve ? to Valid Bigram Creation Bigram Frequency don?t-deserve 9 Word Matrix Position1 ? Position4 Position5 ? Position9 Phrase1 these ? deserve to ? Phrase2 these ? deserve to ? say Phrase3 these ? deserve to ? to Valid Bigram Creation Bigram Frequency deserve-to 9 Figure 19 PRA Rounds 3 and 4 Valid Bigram Creation and Local Frequency Count 40 Likewise, the bigrams for rounds 5 and 6 are valid, as illustrated in figure 20 Word Matrix Position1 ? Position5 Position6 ? Position9 Phrase1 these ? to be ? Phrase2 these ? to be ? say Phrase3 these ? to be ? to Valid Bigram Creation Bigram Frequency to-be 9 Word Matrix Position1 ? Position6 Position7 ? Position9 Phrase1 these ? be educated ? Phrase2 these ? be educated ? say Phrase3 these ? be educated ? to Valid Bigram Creation Bigram Frequency be-educated 9 Figure 20 PRA Rounds 5 and 6 Valid Bigram Creation and Local Frequency Count 41 The final two rounds and resulting valid bigrams are shown in figure 21. Word Matrix Position1 ? Position7 Position8 Position9 Phrase1 these ? educated basic Phrase2 these ? educated they say Phrase3 these ? educated these to Valid Bigram Creation Bigram Frequency Bigram Frequency educated-basic 3 educated-they 3 educated-these 3 Word Matrix Position1 ? Position7 Position8 Position9 Phrase1 these ? educated basic Phrase2 these ? educated they say Phrase3 these ? educated these to Valid Bigram Creation Bigram Frequency Bigram Frequency they-say 1 they-to 1 these-to 1 Figure 21 PRA Rounds 7 and 8 Bigram Creation and Local Frequency Count 42 Table 9 lists the resulting bigrams and local frequency counts from this process. Recall that only those bigrams that were validated against either the evidence or the corpus were counted, and therefore the local bigram and frequency count includes the evidence bigrams. Bigram Frequency Count Bigram Frequency Count Word1 Word 2 Word 1 Word 2 be educated 9 kids don't 9 deserve to 9 these kids 9 don't deserve 9 these to 1 educated these 3 they say 1 educated basic 3 they to 1 educated they 3 to be 9 Table 9 PRA Local Bigrams and Corresponding Frequency Counts Now that the evidence has been collected and the local bigram frequency count has been calculated, the iterative process begins in parallel with each listener. The remainder of this example will focus on listener 1. The purpose of this iterative process is to build unique candidate strings that will ultimately be passed to the master interpreter to represent the most likely spoken phrase. The beginning of the candidate string(s) consists of the first word of listener 1, concatenated with the second word of each listener. This concatenation, as shown in figure 22, results in one unique candidate string: these kids. Word Matrix Position1 Position2 Position3 ? Position9 Phrase1 these kids don?t ? Phrase2 these kids don?t ? say Phrase3 these kids don?t ? to Figure 22 Listener 1-Round 1 PRA Iteration 43 Further concatenations to build up the candidate strings follow a simple matching procedure. If the last word of the current candidate string is equal to the first word of a bigram in the current round, a concatenation can be made. In keeping with the example, there is currently one candidate string: ?these kids?. The next round of valid and unique bigrams for listener 1 consists of ?kids-don?t?, as shown in figure 23. To build up the candidate string, the algorithm takes the last word of the current candidate string (i.e., kids) and compares it to the first word of the bigram in the current round (i.e., kids). Since the two words are equal, a concatenation can be made (Figure 24). The resulting candidate phrase is now: ?these kids don?t?. Word Matrix Position1 Position2 Position3 ? Position9 Phrase1 these kids don?t ? Phrase2 these kids don?t ? say Phrase3 these kids don?t ? to Figure 23 Listener 1-Round 2 PRA Iteration 44 The next round of iterations involves word 3 of listener 1 and word 4 of each listener (figure 25). The new unique bigram is: don?t-deserve. The newly created bigram is valid, according to table 9, and is subsequently used to continuing building the candidate strings, as shown in figure 26. Candidate Strings Current Bigrams these kids kids-don?t New Candidate Strings these kids don?t Valid Bigram Figure 24 Listener 1-Round 2 Candidate Phrase Concatenations Figure 25 Listener 1-Round 3 PRA Iterations Word Matrix Position1 ? Position3 Position4 ? Position9 Phrase1 these ? don?t deserve ? Phrase2 these ? don?t deserve ? say Phrase3 these ? don?t deserve ? to 45 The next round of iterations proceeds in the same manner, as illustrated in figure 27. Likewise, rounds 5 and 6 complete iterations as shown in figure 28. Figure 26 Listener 1-Round 3 PRA Concatenations Candidate Strings Current Bigrams these kids don?t don?t-deserve New Candidate Strings these kids don?t deserve Valid Bigram Figure 27 Listener 1-Round 4 PRA Iteration and Concatenation Word Matrix Position1 ? Position4 Position5 ? Position9 Phrase1 these ? deserve to ? Phrase2 these ? deserve to ? say Phrase3 these ? deserve to ? to Candidate Strings Current Bigrams these kids don?t deserve deserve-to New Candidate Strings these kids don?t deserve to Valid Bigram 46 Figure 28 Listener 1-Rounds 5 and 6 PRA Iterations Word Matrix Position1 ? Position5 Position6 ? Position9 Phrase1 these ? to be ? Phrase2 these ? to be ? say Phrase3 these ? to be ? to Word Matrix Position1 ? Position6 Position7 ? Position9 Phrase1 these ? be educated ? Phrase2 these ? be educated ? say Phrase3 these ? be educated ? to Valid Bigram Candidate Strings Current Bigrams these kids don?t deserve to to-be New Candidate Strings these kids don?t deserve to be Valid Bigram Candidate Strings Current Bigrams these kids don?t deserve to be be-educated New Candidate Strings these kids don?t deserve to be educated Valid Bigram 47 Due to the fact that the first 7 words of the all three listeners are exactly the same, the iterative process produces only one candidate string until the 8th word is reached. Up to this point, the one candidate string is: ?these kids don?t deserve to be educated?. The next subsequent round of iterations involves word 7 of listener 1, along with word 8 of each listener, as displayed in figure 29. As displayed in figure 30, this round of iterations produces three candidate strings: 1. these kids don?t deserve to be educated basic 2. these kids don?t deserve to be educated they 3. these kids don?t deserve to be educated these Word Matrix Position1 ? Position7 Position8 Position9 Phrase1 these ? educated basic Phrase2 these ? educated they say Phrase3 these ? educated these to Figure 29 Listener 1-Round 7 PRA Iteration 48 The next, and final, round of iterations for listener 1 produces two unique bigrams: basic- say and basic-to (figure 31). According to table 9, neither of those two bigrams is considered valid, and therefore, will not be used as concatenations to further build the candidate strings (figure 32). Figure 30 Listener 1-Round 7 PRA Concatenations Word Matrix Position1 ? Position7 Position8 Position9 Phrase1 these ? educated basic Phrase2 these ? educated they say Phrase3 these ? educated these to Figure 31 Listener 1-Round 8 PRA Iteration Candidate Strings Current Bigrams these kids don?t deserve to be educated educated-basic educated-they educated-these New Candidate Strings these kids don?t deserve to be educated basic these kids don?t deserve to be educated they these kids don?t deserve to be educated these Valid Bigrams 49 The final candidate strings produced by listener 1 are as follows: 1. these kids don?t deserve to be educated basic 2. these kids don?t deserve to be educated they 3. these kids don?t deserve to be educated these The next step of the algorithm is to compute the local bigram frequency total of the candidate strings that have been created, according to table 9, and send the candidate string(s) with the highest cumulative frequency count to the master interpreter. The set of candidate strings produced by listener 1 all have a total bigram frequency count of 57 and is shown in table 10. Figure 32 Listener 1-Round 8 PRA Concatenations Candidate Strings Current Bigrams these kids don?t deserve to be educated basic basic-say these kids don?t deserve to be educated they basic-to these kids don?t deserve to be educated these Final Candidate String these kids don?t deserve to be educated basic these kids don?t deserve to be educated they these kids don?t deserve to be educated these Invalid Bigrams 50 Listener 1 Local Bigram Frequency Candidate Phrase 57 these kids don't deserve to be educated basic 57 these kids don't deserve to be educated they 57 these kids don't deserve to be educated these Table 10 Listener 1 Set of Candidate Phrases and Frequency Counts Listeners 2 and 3 proceed in parallel in the same manner as described for listener 1. At the end of this iterative process, once the bigram creation and concatenation procedure has been exhausted, listener 2 produces four candidate strings and corresponding frequency counts as shown in Table 11. Listener 2 Bigram Frequency Candidate Phrase 58 these kids don't deserve to be educated they say 58 these kids don't deserve to be educated they to 57 these kids don't deserve to be educated basic 57 these kids don't deserve to be educated these Table 11 Listener 2 Set of Candidate Phrases and Frequency Counts At the completion of the parallel processing for listener 3, there are three candidate strings. Those candidate strings and corresponding local frequency counts are shown in table 12. 51 Listener 3 Bigram Frequency Candidate Phrase 58 these kids don't deserve to be educated these to 57 these kids don't deserve to be educated basic 57 these kids don't deserve to be educated they Table 12 Listener 3 Set of Candidate Phrases and Frequency Counts Each listener sends only the phrase with the highest frequency sum to the master interpreter. If a listener produced more than one phrase with the same frequency total, those phrases are also sent to the master interpreter. The final set of candidate phrases and frequency counts that will be sent to the master interpreter are listed in table 13. Bigram Frequency Candidate Phrase Listener Number 57 these kids don't deserve to be educated basic 1 57 these kids don't deserve to be educated they 1 57 these kids don't deserve to be educated these 1 58 these kids don't deserve to be educated they say 2 58 these kids don't deserve to be educated they to 2 58 these kids don't deserve to be educated these to 3 Table 13 Master Interpreter Set of PRA Candidate Phrases The final set of candidate strings consists of 6 phrases. It is now up to the master interpreter to select the phrase that will represent the most likely spoken phrase. Three of the six phrases that comprise the final set of candidate strings have a lower bigram frequency count of 57, than the other three phrases at 58 and are discarded by the master interpreter. The remaining three phrases all have the same local bigram frequency count 52 and the master interpreter must break the tie between the three candidate strings, in order to select the string that will serve as the most likely spoken phrase. To accomplish this, the master interpreter checks for equality between the candidate strings and the recognition results from the listeners. Two of the three candidate phrases are the same as the results from the listeners and make up the new set of candidate phrases: 1. these kids don?t deserve to be educated they say 2. these kids don?t deserve to be educated these to The next step of the algorithm is to validate the two candidate phrases against the corpus. Since all of the bigrams within the candidate phrases are maintained in the corpus (table 7), neither candidate phrase is discarded. Finally, the sum of the frequencies of the bigrams is re-calculated for each of the two candidate strings according to the corpus (table 7). The final two phrases and their corresponding frequency counts are listed in table 14, with the winner shown in bold. The higher sum of those two phrases is 6,045 and corresponds to the following phrase: these kids don?t deserve to be educated they say. This phrase is returned by Distributed Listening as the most likely spoken phrase, which is 100% correct when compared to the actual spoken phrase of the user. Bigram Frequency Candidate Phrase 6,045 these kids don't deserve to be educated they say 5,615 these kids don't deserve to be educated these to Table 14 PRA Most Likely Spoken Phrase 53 The full system design of Distributed Listening is comprised of listeners, interpreters, and a reconciliation algorithm that are all supported by a corpus and a temporary storage medium. The theoretical design and preliminary results suggest that this system design was practical and possible, yet a full experiment was needed to ultimately rate the effectiveness of the system design. A number of experiments were run for that purpose and the experimental setup will be discussed in the next chapter. 54 4 Experiment Design The goal of the research experiments was to effectively evaluate Distributed Listening with regard to recognition accuracy rates when combining recognitions of individual listeners through a distributed approach. Recall that the hypothesis is that Distributed Listening will only perform as bad as the best individual listener. The experiments were necessary to support and validate the premise of Distributed Listening and will be discussed in the remainder of this chapter. 4.1 Internal Review Board The first aspect of the experiment that had to be addressed was the approval from the Internal Review Board (IRB). The IRB is an institutional entity that protects the rights of human research participants. Since Distributed Listening relies on spoken input from humans, IRB approval was necessary before the experiment could begin. The spoken inputs used in the experiment, as described in section 4.2, were collected from the internet and were readily available and open to the public, therefore the final IRB approval designation was exempt under 45 CFR 46.110(b)(4). This IRB request was very straight-forward as there was no need to find users with specialized knowledge, there was no Graphical User Interface that required a user satisfaction evaluation, nor was there a need to evaluate the resulting system with regard to usability. Details of the IRB approval can be found in appendix D. 55 4.2 Data The data needed for the experiment were in the form of spoken utterances and were taken at random from an online broadcast of the National Public Radio. In lieu of having a variety of users read the same set of phrases to the listeners, 56 random phrases were selected from the broadcast and used instead (see table 15). Each phrase was saved as an individual MP3 file. The average length of the utterances in words was 13, with the shortest phrase containing 4 words and the longest phrase containing 25 words. Using the recordings in this manner ensured consistency, eliminated the chance of having filler words, such as ?ah? and ?um?, embedded into an utterance, and kept the speaking rate uniform. In addition, the 56 recordings underwent a gain procedure using MP3Gain (MP3Gain 2010). MP3Gain is a software application that optimizes audio recordings so that a set of files have the same average loudness level without sacrificing quality or re- encoding. This is to ensure that when playback of the audio switches from one file to the next, the volume level will remain constant. An inherent problem to this type of normalization is clipping, where certain files are clipped so that they do not exceed the maximum allowable decibel level. The clipping creates a rough scratchy sound during loud parts of the audio recording. Most MP3 files will not have clipping at 89.0 dB, which was the setting used to normalize the 56 files. 56 Transcription Word Length Seconds your handwriting has changed 4 1.797 it's become increasingly physical 4 2.199 how long can he maintain his equilibrium 7 2.947 women's rights advocates are cautiously hopeful now 7 2.939 it?s a disease that affects how you move 8 2.445 an arm doesn't swing quite the same way 8 2.328 there are medications that make a real difference 8 2.950 spend or cut taxes that is the question 8 3.355 but are there any bright spots out there 8 2.276 the temporary job market may provide an answer 8 2.888 the authorities have been slow to address it 8 2.270 a lot has happened before you first notice it 9 3.271 and you'd stop the disease before it even starts 9 2.458 would stem cell research figure into the genetic connection 9 3.764 and it does appear that there is a relationship 9 2.670 it helps with some problems and not with others 9 2.987 these kids don't deserve to be educated they say 9 2.947 an egyptian court convicted a man of sexual harassment 9 3.085 and we also saw that a number of researchers left 10 2.779 i also understand the moral objections that some people have 10 4.070 but he was young and this had happened seemingly overnight 10 3.151 genetics load the gun and the environment pulls the trigger 10 3.242 let them rot in their dead end low class jobs 10 3.180 that's why those real time classroom scenes are so startling 10 4.406 then they all take a deep breath and enter the arena 11 3.093 and that it wasn't immune from last year's deteriorating economic environment 11 4.085 a finger that wiggles and you can't really get it to stop 12 3.566 is well in motion before any of these symptoms first become clear 12 4.469 the only reason i really did was because of my family history 12 3.968 and that was the prevailing wisdom for a very very long time 12 3.363 but they don't necessarily need or want to know more than that 12 2.976 but just when you're getting a warm utopian feeling something bad happens 12 4.668 but the more we learn about the disease the more complicated it becomes 13 3.370 in lots of ways it made the research task that much more complex 13 4.752 and now it's nominated for an academy award for best foreign language feature 13 4.034 for a brief spell they seem younger more open and ready to learn 13 5.252 57 Transcription Word Length Seconds i'll say the hero of the movie threatens to become its bad guy 13 4.551 and by then the lesson has been derailed and there are snickers all around 14 4.785 what finally rouses most of his students is an assignment to write self portraits 14 5.398 he writes on the blackboard and a student makes him stop and define a word 15 3.997 he tells his colleagues that it's the job of the teacher to bring kids out 15 4.231 pink slips seem to be raining down in just about every sector and every zip code 16 4.843 i noticed that i didn't think that my arm was swinging quite the same way when i jogged 18 4.939 i guess i choose to believe that i'll be able to do this for a very long time 18 4.252 the class is a semi improvised look inside a high school in a diverse working class paris neighborhood 18 6.211 the toy maker based here in los angeles said in november it would shed about a thousand jobs 18 5.047 they show you that at least until the system can be changed the battles will be moment to moment 19 7.294 but i think it's pretty clear that it had a negative effect on the way in which the field progressed 20 6.127 they went to other places that were more open to stem cell research whether that was in europe or in singapore 21 5.886 he had been in declining health for the past year dealing with the long term effects of the stroke he suffered 21 5.376 the first thing i remember noticing was this odd buzzing tingling sensation in my left leg and to some extent my left arm 22 9.297 republicans disagree complaining that the bill's tax cuts fall short and that it spends too much on things they say won't create jobs 23 7.848 sales of both product lines fell more than twenty percent last year as fewer people bought toys leading up to the holiday season 23 6.302 i used to feel that my cell phone was vibrating and i'd reach for it and then i'd find that there was nothing there 24 6.795 for a lot of people it's a dilemma because your faith might be telling you one thing and your body is telling you another 24 7.126 and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true 25 7.352 Table 15 Actual Spoken Phrases 58 4.3 Environment The experiment was conducted in the human centered computing lab within the computer science and software engineering department at Auburn University. The lab was a semi- private room in that it was home to a few graduate students, but was not open to all of the students of the department. Occasionally, background noise such as a door closing, a phone ringing, the hum from the air conditioning and heating unit, and a microwave signal indicating the end of a cooking cycle would be present. To ensure that the environment background noise remained constant throughout the duration of the experiment, a decibel meter was used to measure the noise level of the lab. A decibel (dB) is a measurement level for sound magnitude and has one of three common weightings (Pierce 1981). The decibel meter used in the experiment was capable of measuring either an A-weighting A-curve or C- weighting C-curve frequency characteristics. C-weighting is commonly used for musical material whereas A-weighting responds to frequencies in the 500-to-10,000 Hz range, which is the range most sensitive for human ears and was used in the experiment. The decibel meter is capable of measuring sound ranging from 50 dB to 126 dB, where average conversation is measured at 60 dB (Durrant and Lovrinic 1995). The average sound measurement of the lab was consistently below 50 dB and implied that there was no significant measure of background noise that would interfere with the speech recognition attempts. 59 4.4 Materials A combination of hardware and software was used to implement the design and experimentation of Distributed Listening, including the listeners and the speech recognition software. Each of the components listed in this section and subsequent sub- sections are necessary and act as a team. 4.4.1 Listeners Distributed Listening, in theory, is capable of processing recognitions from any number of listeners. For the scope of this experiment, three listeners were utilized as follows: 1. Listener 1: Toshiba Satellite A355-S6935 with an Intel? Core? 2 Duo CPU T6400 and 2.99 GB RAM running Microsoft Windows XP Professional, Service Pack 3. This listener will also be referred to as Satellite. 2. Listener 2: Dell Latitude D830 with an Intel? Core? 2 Duo CPU T7500 and 2.00 GB of RAM running Microsoft Windows XP Professional, Service Pack 3. Listener 2 will also be referred to as Dell. 3. Listener 3: Toshiba Satellite with a Genuine Intel? CPU T2600 and 3.24 GB RAM running Microsoft Windows XP Tablet Edition, Service Pack 3. This listener will also be referred to as Tablet. Microsoft Windows XP has an option in the Speech Control Panel to train profiles to improve recognition accuracy. Each of the listeners has this option and it was not utilized in order to support the over-arching goal of Distributed Listening in that the system improves speech recognition accuracy in sub-optimal speaker-independent environments. 60 4.4.2 Dragon NaturallySpeaking 10 Dragon NaturallySpeaking is a commercially available speech recognition software solution created by Nuance Communications that includes a recognition engine. It is a speaker dependant recognition system that is optimized when profiles are created and trained for the way specific users speak and the way they use words in regard to a vocabulary and language model. Since a premise of Distributed Listening is to increase accuracy on untrained systems, the user profile that was created for the experiment was not trained or optimized. Dragon NaturallySpeaking 10 was loaded onto each of the listeners and was the interface between capturing the audio recordings and transcribing them. Dragon NaturallySpeaking 10 has tool called DragonPad, which is a built in word processing feature, optimized for dictation. While Dragon NaturallySpeaking will work with other word processors, it was a natural selection to use the provided tool. 4.4.3 Corpus The corpus that populated the database was taken from the open portion of the American National Corpus (OANC) (Ide and Suderman 2007). The open portion of the OANC contains approximately 15 million words from both spoken and written sources. The words from the OANC are presented in text files. Those text files were parsed using a script, the bigrams were found, and a MySQL database table was populated with the bigrams and a frequency count for each bigram. Because the OANC included spoken and transcribed words, there were bigrams that were removed. Those included bigrams that had filler words, like ?um? and ?ah?, repeated words that came from stuttering, and 61 grammatically incorrect bigrams like ?it?s is?. It?s important to note that not all repeated words are assumed incorrect. For example, the sentence ?it?s been a very very long trip? contains the bigram ?very-very? that can be assumed was deliberately used to put emphasis on the type of trip that was taken. In addition, punctuation marks were removed and hyphens were replaced with spaces within the OANC text files. The final corpus contained 456,981 unique bigrams with a total frequency count of 3,291,722. 4.5 Procedure Three separate tests were run, with each test having 56 trials. The differences in the three tests were the ways in which the audio was collected by the listeners, otherwise the procedure was the same and adhered to the following steps: 1. Each listener received the 56 audio recordings in the same order. 2. The audio was processed using the Dragon NaturallySpeaking 10 software. 3. The transcribed recognitions of each listener were processed through the Distributed Listening algorithm. 4. For each of the 56 recordings, the algorithm produced the most likely spoken phrase, which was subsequently saved to a text file for further review. The difference in this procedure occurred in step 1. There were three separate methods used for capturing audio by the listeners, as described in the next three subsections. 4.5.1 Stereo Mix Option The first test captured the audio using the stereo mix option of the laptops. Stereo mix is an option that allows a computer to hear sound that is playing through the computer?s sound card. The computers microphone is not capturing the sound playing through the 62 speakers; rather the computer is internally capturing sound playing through the sound card. This was done by loading the 56 separate recordings into Windows Media Player and playing them sequentially, with each listener. As the recordings played through Windows Media Player, Dragon NaturallySpeaking 10 captured the recognitions. The 56 recognitions were saved to a text file, to be used with the recognitions from the other two listeners. 4.5.2 External Speaker and External Microphone Option The second test mimicked the way in which a typical person would speak to a computer, meaning using a typical desktop microphone plugged into the standard ?mic in? jack of the laptop. When a person speaks into a microphone, placement of the microphone, as well as the volume of the speech, is important. Therefore, each laptop microphone was configured for optimal volume level, and that level of speech was recorded using the decibel meter. After which, the 56 recordings were loaded onto a SONY ICD-P17 digital recorder and an external speaker was plugged into the ear jack of the recorder. The volume of the speaker was then set to the optimized level that was found using the decibel meter (approximately 70 dB). The external speaker was then placed approximately two inches away from the external microphone (as is the case when a person speaks into a desktop microphone), Dragon Naturally Speaking was started, and the recognitions were captured and saved to a text file (figure 33). 63 Figure 33 External Speaker/Microphone Experiment Setup 4.5.3 3.5mm Cable Option The last test involved a 3.5mm male-to-male audio cable. One end of the cable was plugged into the ear jack of the digital recorder and the other end was plugged into the microphone jack of the laptop, using the line-in option. Line-in allows a computer to capture audio devices that are connected to the computer. Once the cable was connected and the settings selected, Dragon NaturallySpeaking 10 was loaded and the 56 files were played through the digital recorder (figure 34). Dragon NaturallySpeaking captured the recognitions that were ultimately saved to a text file and used with the recognitions from the other two listeners. 64 Figure 34 3.5mm Cable Experiment Setup The three separate experiments each produced very different results and will be discussed in detail in the next chapter. 65 5 Research Findings The first metric used to determine the success of Distributed Listening was the accuracy percentage of how often the result from Distributed Listening matched the actual spoken phrase, compared to the same accuracy percentage of the individual listeners. The actual formula used was: The three tests described previously in section 4.5 were further broken down into categories. After the completion of the 56 trials of each test, the comparative analysis had 6 designations: 1. Overall: This category compared the overall accuracy of the individual listeners with the result returned by Distributed Listening across all 56 trials. 2. Overall Interpretation: For all 56 trials, this category looked at the individual recognition results of the listeners, as well as the result of the Distributed Listening system, and counted those results that had a semantic interpretation that matched the actual spoken phrase as correct. If the semantic meaning of a recognition or Distributed Listening result resolved to the meaning of the actual spoken phrase, then that result had a correct semantic interpretation and was considered equal to the actual spoken phrase for calculating accuracy. Instances Where Result Matched Actual Spoken Phrase Total Number of Actual Spoken Phrases Accuracy = 66 3. Valid: This category put a definition on the actual spoken phrases. If all of the bigrams of the actual spoken phrase were in the corpus, then the phrase was deemed valid. The individual listeners and the Distributed Listening result that corresponded to a valid actual spoken phrase were used to compute the accuracy in this category. 4. Valid Interpretation: This category looked at the subset of phrases that were deemed valid and if a semantic interpretation of the result of an individual listener or Distributed Listening matched the actual spoken phrase, it was also counted as correct when computing accuracy. 5. Invalid: This category was made up of those actual spoken phrases, and corresponding results, whose bigrams did not appear in the corpus. 6. Invalid Interpretation: From the subset of invalid actual spoken phrases, if a recognition or result had a semantic interpretation that resolved to the actual spoken phrase, it was also counted as correct. The second metric used to test the success of Distributed Listening was the word error rate (WER), as calculated by the following formula: where: S = The number of substitutions D = The number of deletions I = The number of insertions N = The number of words in the correct word sequence S + D + I N WER = 67 The WER is based on the Levenshtein distance, also called the edit distance, and is a common metric used for establishing the accuracy of speech recognition (McCowan et. al. 2005). The number returned by the formula is an indication of how similar two character sequences are, based on the number of character additions, substitutions, and deletions it takes to turn one character sequence into the other, with the minimum number of changes. In recent years, there have been research activities to establish a more accurate metric. Some scholars argue that the current WER is hard to interpret since it is quite possible to have a WER that can be greater than 1or a WER that can be negative. The fact that a recognition result can be significantly shorter or longer than the actual word sequence allows the WER to be greater than or less than unity. Additionally, the WER measure does not address word importance. Despite the validity of the arguments against the WER, a new metric has not been established and therefore, the WER is the metric that will be used for the data analysis of this research task. The reference character sequence for each WER calculation presented in this chapter will be the actual spoken phrase that corresponds to each listener?s recognition results and the results of Distributed Listening. No additional evaluation measures will be used. Demographical characteristics were not possible given the nature of the data collection of the spoken phrases. A best guess could have been attempted to determine the gender of the speaker, but without explicitly being told, there was no way to definitively determine age, education level, experience with spoken language systems, reading level, nationality, or gender. Each of the three tests underwent the same analysis as discussed in the next three subsections. 68 5.1 Stereo Mix Option During this trial, the listeners had the best individual results and Distributed Listening returned optimal results, as shown in table 16. Within the overall category, across all 56 utterances, the best individual listener had a recognition accuracy of 48%, which was also the accuracy of Distributed Listening. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Overall 36% 48% 39% 48% Overall Interpretation 38% 50% 43% 52% Valid 32% 50% 39% 54% Valid Interpretation 32% 50% 43% 57% Invalid 39% 46% 39% 43% Invalid Interpretation 43% 50% 43% 46% Table 16 Stereo Mix Option Recognition Accuracy Rates The overall interpretation category showed an improvement in recognition accuracy over the overall category for all three listeners, as well as Distributed Listening. Listener 1 and listener 2 both improved by 2%, which is the equivalent of one additional result being counted as correct. Listener 3 and Distributed Listening improved by 4%, or two additional results being counted as correct. The best individual listener within the overall interpretation category had an accuracy of 50%, which Distributed Listening exceeded at 52%. The correct semantic interpretations of the listeners and Distributed Listening are 69 displayed in table 17, with the corresponding semantic equivalents shown as underlined text. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Actual Spoken Phrase it's a disease that affects how you move is a disease that affects how you move it is a disease that affects how you move it is a disease that affects how you move it's a disease that affects how you move but he was young and this happened seemingly overnight but he was young and this happened seemingly overnight but he was young and this happened seemingly overnight but he was young and this happened seemingly overnight but he was young and this had happened seemingly overnight Table 17 Correct Semantic Interpretations of the Actual Spoken Phrase For the valid phrases category, Distributed Listening had an accuracy rate of 54%, which was better than the best individual listener. The valid interpretation category showed that only listener 3 and Distributed Listening improved over the valid phrases category. Listener 3 improved from 39% to 43%, yet still did not have the best individual accuracy. The best individual listener achieved an accuracy of 50%, compared to Distributed Listening at 54%. The last two categories, invalid and invalid interpretation, showed that Distributed Listening did not beat the best individual listener and is based on a special case. In one of the trials, listener 1 and listener 3 agreed word-for-word on their recognition results, yet those resulting recognitions were not equal to the actual spoken phrase. In contrast, listener 2 recognized the actual spoken phrase at 100% correctness (table 18). This caused Distributed Listening to return as the most likely spoken phrase, the recognition 70 from the listeners that agreed. Because of this special case, the accuracy result of Distributed Listening in both categories is less than optimal. If that particular trial is omitted from the calculations, then the Distributed Listening accuracy rate would be equal to that of the best individual listener in both the invalid and invalid interpretation categories. The recognition results of the three listeners, the result from Distributed Listening, and the actual spoken phrase for the aforementioned trial are listed in table 18, with the differences of the phrases indicated with underlined text. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Actual Spoken Phrase they show you that at least until the system can be changed but battles will be moment to moment they show you that at least until the system can be changed the battles will be moment to moment they show you that at least until the system can be changed but battles will be moment to moment they show you that at least until the system can be changed but battles will be moment to moment they show you that at least until the system can be changed the battles will be moment to moment Table 18 Agreement Between Listeners on Incorrect Recognitions For a complete listing of the recognition results from the listeners, the Distributed Listening results, and the actual spoken phrases from all 56 trials, refer to appendix A. Appendix A also indicates which of the actual spoken phrases are considered valid and which are considered invalid. Distributed Listening returned optimal results with regard to the word error rate. Listener 1 had an average WER of 0.36. Listeners 2 and 3 had average word error rates of 0.31 and 0.34, respectively. Distributed Listening had the lowest average word error rate of 71 0.26. The word error rates of each listener and Distributed Listening for all 56 trials are displayed in table 19. The phrase numbers in the table correspond to the phrase numbers as listed in appendix A. Phrase Number Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening 1 0.00 0.25 0.25 0.25 2 0.58 0.00 0.42 0.00 3 0.38 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 6 0.33 0.00 0.00 0.00 7 0.57 0.26 0.09 0.09 8 0.83 0.67 0.63 0.63 9 0.33 0.33 0.33 0.33 10 0.00 0.00 0.00 0.00 11 0.00 0.00 0.00 0.00 12 0.89 0.22 0.22 0.22 13 0.44 0.44 0.44 0.44 14 0.31 0.54 0.62 0.54 15 0.00 0.00 0.00 0.00 16 0.40 0.70 0.40 0.40 17 0.00 0.33 0.00 0.00 18 0.10 0.20 0.20 0.20 19 0.00 0.00 0.00 0.00 20 0.08 0.08 0.08 0.08 21 0.40 0.40 0.40 0.40 22 0.00 0.00 0.00 0.00 23 0.00 0.00 0.00 0.00 24 0.00 0.00 0.00 0.00 25 0.56 0.56 0.56 0.56 26 0.00 0.00 0.00 0.00 27 0.83 0.75 1.50 0.75 28 0.00 0.00 0.00 0.00 29 0.00 0.00 0.00 0.00 30 0.06 0.06 0.06 0.06 31 0.55 0.55 0.55 0.55 32 0.33 0.33 0.33 0.33 33 0.44 0.00 0.00 0.00 34 0.79 0.14 0.14 0.14 35 0.78 0.00 0.56 0.00 72 Phrase Number Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening 36 0.40 0.00 0.00 0.00 37 0.00 0.00 0.00 0.00 38 0.00 0.00 0.20 0.00 39 0.79 1.00 0.79 0.79 40 0.38 0.00 0.00 0.00 41 0.00 0.00 0.00 0.00 42 0.15 0.38 0.15 0.15 43 2.10 1.20 1.70 1.70 44 0.16 0.00 0.16 0.16 45 0.35 0.35 0.22 0.35 46 1.13 0.83 1.09 1.09 47 1.67 1.17 1.67 1.67 48 0.18 0.73 1.00 0.18 49 0.50 0.88 0.38 0.50 50 0.94 1.00 0.94 0.94 51 0.00 0.00 0.88 0.00 52 0.00 0.00 0.00 0.00 53 0.75 2.25 0.75 0.75 54 0.13 0.63 0.88 0.13 55 0.00 0.00 0.00 0.00 56 0.57 0.00 0.57 0.00 Average 0.36 0.31 0.34 0.26 Table 19 Stereo Mix Word Error Rates 5.2 External Speaker and External Microphone Option During this test, the individual listeners had the poorest recognition results. For a full listing of recognition results from this experiment, including the variability of the 3 listeners, see appendix B. Table 20 displays the actual accuracy results of the 56 trials, across the 6 categories. Notice that listener 1 (Satellite) produced 0% accuracy across all 6 categories. Individually, the other two listeners did not achieve accuracy rates that were much better than those of listener 1. Yet, across the 6 categories of this test, Distributed Listening consistently met the recognition accuracy of the best individual 73 listener. Specifically, for each of the 6 categories, Distributed Listening matched the accuracy of the best individual listener. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Overall 0% 7% 2% 7% Overall Interpretation 0% 7% 2% 7% Valid 0% 4% 0% 4% Valid Interpretation 0% 4% 0% 4% Invalid 0% 11% 4% 11% Invalid Interpretation 0% 11% 4% 11% Table 20 External Speaker/Microphone Option Recognition Accuracy Rates This test is the only test that did not show an improvement in recognition accuracy for any of the listeners or Distributed Listening when semantic interpretation of the actual spoken phrases was taken into account. The poor recognition accuracy of the listeners and the resulting percentages show that when nonsensical recognitions are combined with syntactically and grammatically correct recognitions and a reliable corpus, optimal results can still be produced. It is also worth mentioning that one of the 56 trials did not produce a conclusive result (table 21). This is due to the fact that the number of permutations of a recognition result, and therefore the resulting potential candidate phrases, from just one listener numbered greater than 45,349,600. Even in a real-time parallel processing environment, an exhaustive listing of candidate phrases is prohibitive with regard to time. This is attributed to the accuracy of the individual listeners. When the individual listeners differ 74 at every word, by position, and there are several words per recognition result, it is not feasible to produce an accurate result and Distributed Listening is not a practical solution in such environments. The accuracy results shown in table 20 are calculated out of 55 trials, since one round of trials was inconclusive. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Actual Spoken Phrase though the feature set aside as he was the one that has been known as for the picture is satisfied when they first what would be wrong with that and then know if as for the future have to set aside when they are well with you will not admit to it inconclusive and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true Table 21 External Speaker/Microphone Inconclusive Round The average word error rates of the three listeners and Distributed Listening are expected to be greater than one, because of the inconsistencies of the individual recognizers. Given that the individual listeners had such poor recognition accuracy and didn?t readily resemble the corresponding actual spoken phrase, it is expected that the WER will be quite high. Likewise, the resulting candidate phrases that were returned from Distributed Listening did not readily resemble the corresponding actual spoken phrase and had extremely high word error rates. Listener 1, which had 0% accuracy, had the highest average WER at 3.47. This WER was considerably larger than the rates of listeners 2 and 3, and over twice as great as that of Distributed Listening. Listener 2 had an average WER of 1.64, listener 3 had an average WER of 1.97, and Distributed Listening returned 75 the lowest average WER of 1.60. Because of the inconclusive round, the average word error rates are calculated out of 55 trials and the inconclusive round is not listed in table 22. Phrase Number Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening 1 2.00 1.25 0.63 0.63 2 3.58 2.08 2.33 2.08 3 3.25 2.75 3.38 2.75 4 2.00 0.00 0.00 0.00 5 3.67 1.75 3.33 1.75 6 3.44 0.33 1.56 0.33 7 4.00 1.70 1.91 1.91 8 3.50 1.83 0.83 0.83 9 3.56 0.67 1.56 0.78 10 3.92 0.00 1.17 0.00 11 4.08 1.00 0.83 0.83 12 3.78 1.44 2.00 1.44 13 5.67 2.78 3.00 3.00 14 4.00 0.92 2.00 0.92 15 3.54 0.46 1.00 0.46 16 3.90 1.80 0.80 0.80 17 4.62 2.38 2.33 2.33 18 1.40 1.25 1.05 1.05 19 4.30 0.00 0.50 0.00 20 4.00 1.96 2.50 1.96 21 3.00 4.30 3.40 3.00 22 1.00 0.44 0.44 0.44 23 2.50 2.20 1.50 1.50 24 5.00 0.00 1.88 0.00 25 2.44 1.22 0.78 0.78 26 3.00 0.39 0.89 0.39 27 4.25 3.00 1.75 1.75 28 2.57 2.14 1.24 2.10 29 1.00 0.62 2.00 0.62 30 2.83 1.50 2.39 3.06 31 3.45 1.27 1.18 1.18 32 2.80 2.33 2.67 2.33 34 3.14 2.64 2.93 2.64 35 4.78 0.33 1.78 1.22 36 2.40 0.70 2.30 0.70 76 Phrase Number Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening 37 4.14 1.71 0.71 1.71 38 3.80 1.60 2.00 1.87 39 5.36 2.07 4.21 2.07 40 3.92 2.23 3.54 2.23 41 4.42 2.92 4.00 3.42 42 3.54 1.38 2.08 2.08 43 4.80 3.50 3.10 3.50 44 3.26 1.05 2.42 1.05 45 2.48 1.91 2.17 2.17 46 4.22 3.17 3.74 3.09 47 3.83 2.72 2.94 2.94 48 5.09 4.00 5.18 5.18 49 3.50 1.63 2.25 1.63 50 2.13 3.13 2.50 2.56 51 3.00 1.00 1.00 1.00 52 1.50 0.75 1.38 0.75 53 6.75 2.25 2.25 2.25 54 4.50 0.63 0.88 0.63 55 1.89 2.33 1.56 1.56 56 2.14 0.57 0.57 0.57 Average 3.47 1.64 1.97 1.60 Table 22 External Speaker/Microphone Word Error Rates 5.3 3.5mm Cable Option The accuracy rates of this experiment are only slightly better than those of the external speaker/microphone experiment and are listed in table 23. For the overall category, the best individual listener achieved an accuracy of 7%, which Distributed Listening exceeded at 9%. 77 Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Overall 5% 7% 7% 9% Overall Interpretation 7% 7% 7% 9% Valid 7% 7% 4% 7% Valid Interpretation 11% 7% 4% 7% Invalid 4% 7% 11% 11% Invalid Interpretation 4% 7% 11% 11% Table 23 3.5mm Cable Option Accuracy Rates The overall interpretation category, when compared to the overall category, showed an increase in just one of the listeners. Interestingly, listener 1 increased accuracy from 5% to 7%, yet the result from Distributed Listening did not reflect that increase. Although listener 1 had a semantic result that resolved to the same meaning as the actual spoken phase, listeners 2 and 3 agreed on aspects of their recognition results, which caused Distributed Listening to return a result that included the agreed upon bigrams, as shown in table 24. The semantic equivalents are indicated with underlined text in the table. Notwithstanding the round listed in table 24, Distributed Listening still had the best accuracy of this category at 9%, with the best individual listener achieving 7%. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Actual Spoken Phrase authorities have been slow to address it the authority has been slow to attract the authority to go to the authority has been slow to it the authorities have been slow to address it Table 24 3.5mm Cable Correct Semantic Interpretation of the Actual Spoken Phrase 78 The valid category had a best individual listener at 7% accuracy, which Distributed Listening met. The valid interpretations category showed an improvement over the valid category with listener 1 from 7% to 11%, yet Distributed Listening again did not reflect that increase. In this category, 11% accuracy is better than Distributed Listening at 7%. As shown previously in table 24, the semantic interpretation of listener 1 for one of the trials resolved to the correct meaning of the actual spoken phrase, but the agreement between listeners 2 and 3 on misrecognized bigrams caused Distributed Listening to return a result that included the agreed upon bigrams. The last two categories of table 23, invalid and invalid interpretation, show that Distributed Listening was as good as the best individual listener. For both categories, the best individual listener achieved an accuracy of 11%, which was the same accuracy of Distributed Listening. The complete list of recognition results for this experiment is displayed in appendix C. Distributed Listening had the lowest word error rate out of this dataset, at 1.55. Listeners 1, 2, and 3, had word error rates of 2.53, 1.70, and 1.57, respectively. Notice that the average WER of Distributed Listening is extremely close to that of Listener 3. Because of the variability of the recognition results of the individual listeners, this is expected. The WER is calculated based on the number of character insertions, deletions, and substitutions needed to change one string into a reference string. Given the variability of the recognition results, as Distributed Listening created candidate strings from recognitions that were overwhelmingly different from the actual spoken phrase, those 79 candidate strings had a number of insertions, deletions, and substitutions. The complete list of word error rates is listed in table 25. Phrase Number Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening 1 0.00 0.00 0.00 0.00 2 2.83 0.83 0.75 0.75 3 2.38 1.00 1.38 1.38 4 0.00 0.00 0.00 0.00 5 4.25 2.58 0.50 0.50 6 1.22 2.89 0.33 0.33 7 2.13 1.83 1.00 1.00 8 1.96 1.79 1.42 1.29 9 1.39 1.22 1.11 1.22 10 1.08 2.92 0.50 0.50 11 1.83 1.75 0.00 0.00 12 1.67 1.22 1.89 1.22 13 3.22 2.67 3.44 3.44 14 2.38 3.00 2.62 2.62 15 1.08 2.38 0.31 0.31 16 2.10 1.20 0.80 0.80 17 3.52 0.71 0.86 0.71 18 2.00 1.00 0.50 0.50 19 1.60 0.40 0.00 0.40 20 2.54 0.75 1.63 0.75 21 2.00 1.00 2.00 1.00 22 0.00 0.00 0.44 0.00 23 3.90 0.30 1.00 0.30 24 1.38 2.13 3.88 5.13 25 1.00 1.89 0.67 5.78 26 3.06 1.06 0.28 3.06 27 3.33 0.83 1.58 0.83 28 2.38 1.05 1.19 2.10 29 2.23 0.00 0.69 0.00 30 2.00 2.39 1.17 1.17 31 2.09 0.82 1.09 1.09 32 2.87 1.60 2.67 1.60 33 2.88 1.40 2.04 1.40 34 4.00 2.71 1.93 1.93 35 2.78 1.11 1.22 1.22 36 2.10 2.50 2.10 2.50 37 4.29 2.00 3.00 3.00 80 Phrase Number Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening 38 3.07 1.47 2.53 1.47 39 4.07 2.29 2.00 2.00 40 3.38 2.92 3.54 3.38 41 4.67 3.00 2.25 2.25 42 3.15 1.08 1.69 1.08 43 4.30 4.00 3.20 3.20 44 3.47 1.16 1.11 1.16 45 3.13 1.43 2.17 1.26 46 4.22 3.83 1.96 1.96 47 3.89 2.50 2.94 2.50 48 5.00 4.00 3.27 3.27 49 3.88 2.75 3.38 2.75 50 2.81 3.06 2.44 2.44 51 3.75 1.25 1.25 1.25 52 2.25 1.88 0.75 0.75 53 2.25 2.25 2.50 2.50 54 0.50 1.50 3.25 1.50 55 1.89 1.56 1.33 1.44 56 0.57 0.57 0.57 0.57 Average 2.53 1.70 1.57 1.55 Table 25 3.5mm Cable Word Error Rates 5.4 n-Listeners Combinations The previous research findings were calculated using 3 listeners, yet Distributed Listening, in theory, is capable of using any number of multiple listeners. Using that principle, it is noteworthy to present data analysis using a combination of the Distributed Listening results in additional multi-listeners environment and will be presented in the following three subsections. The baseline comparison for the data analysis will be the accuracy rates of the stereo mix experiment, as the accuracy rates from that experiment were the best. 81 5.4.1 Stereo/Speaker Combination The most likely spoken phrases as returned by Distributed Listening from the 3-listeners experiments using the stereo mix option and the external speaker/microphone option were theoretically combined to simulate a 6-listener experiment. The results, per round of the 56-trials, were processed by calculating the cumulative local bigram frequency count of each phrase and selecting the phrase with the greatest total as the most likely spoken phrase. The individual bigram frequencies for this experiment were the same counts created from the 3-listeners experiments and the same categories used in the 3- listeners experiments were used in the resulting data analysis. The accuracy results of this experiment are shown in table 26. In the table, the column Stereo DL represents the most likely spoken phrases that were returned by Distributed Listening during the stereo mix experiment. The column Speaker DL represents the results of Distributed Listening from the external speaker/microphone experiment. The last column of the table corresponds to the results returned from this experiment. Stereo DL Speaker DL Distributed Listening Overall 48% 7% 48% Overall Interpretation 52% 7% 52% Valid 54% 4% 54% Valid Interpretation 57% 4% 57% Invalid 43% 11% 43% Invalid Interpretation 46% 11% 46% Table 26 Stereo/Speaker Combination Recognition Accuracy Rates 82 Recall that the external speaker/microphone experiment had one round of trials that was inconclusive. In that case, the result from the stereo mix experiment was chosen by default. The Distributed Listening results of this experiment remained consistent from the stereo mix experiment. For each category of the experiment, the accuracy did not change from the stereo mix experiment. At first glance, there appears to be no merit to combining additional ASR system with regard to accuracy. In contrast, this experiment showed that combining ASR systems that have relatively good recognition results with systems that have extremely poor recognition results does not degrade the resulting accuracy. The external speaker/microphone experiment resulted in the poorest accuracy results, with one listener reporting an overall accuracy of 0%. In all likelihood, the accuracy could improve if all of the combined ASR systems have relatively good individual recognition results. The conclusion of the WER calculations is similar to that of the accuracy results. As shown in table 27, the average WER remained the same when compared to the stereo mix experiment. This is definitely promising as the poor results from the external speaker/microphone experiment did not increase the WER. Phrase Number Stereo DL Speaker DL Distributed Listening 1 0.25 0.63 0.25 2 0.00 2.08 0.00 3 0.00 2.75 0.00 4 0.00 0.00 0.00 5 0.00 1.75 0.00 6 0.00 0.33 0.00 7 0.09 1.91 0.09 8 0.63 0.83 0.83 83 Phrase Number Stereo DL Speaker DL Distributed Listening 9 0.33 0.78 0.33 10 0.00 0.00 0.00 11 0.00 0.83 0.00 12 0.22 1.44 0.22 13 0.44 3.00 0.44 14 0.54 0.92 0.54 15 0.00 0.46 0.00 16 0.40 0.80 0.40 17 0.00 2.33 0.00 18 0.20 1.05 0.20 19 0.00 0.00 0.00 20 0.08 1.96 0.08 21 0.40 3.00 0.40 22 0.00 0.44 0.00 23 0.00 1.50 0.00 24 0.00 0.00 0.00 25 0.56 0.78 0.56 26 0.00 0.39 0.00 27 0.75 1.75 0.75 28 0.00 2.10 0.00 29 0.00 0.62 0.00 30 0.06 3.06 0.06 31 0.55 1.18 0.55 32 0.33 2.33 0.33 33 0.00 inconclusive 0.00 34 0.14 2.64 0.14 35 0.00 1.22 0.00 36 0.00 0.70 0.00 37 0.00 1.71 0.00 38 0.00 1.87 0.00 39 0.79 2.07 0.79 40 0.00 2.23 0.00 41 0.00 3.42 0.00 42 0.15 2.08 0.15 43 1.70 3.50 1.70 44 0.16 1.05 0.16 45 0.35 2.17 0.35 46 1.09 3.09 1.09 47 1.67 2.94 1.67 48 0.18 5.18 0.18 49 0.50 1.63 0.50 50 0.94 2.56 0.94 84 Phrase Number Stereo DL Speaker DL Distributed Listening 51 0.00 1.00 0.00 52 0.00 0.75 0.00 53 0.75 2.25 0.75 54 0.13 0.63 0.13 55 0.00 1.56 0.00 56 0.00 0.57 0.00 Average 0.26 1.60 0.26 Table 27 Stereo/Speaker Combination Word Error Rates 5.4.2 Stereo/Cable Combination Like the stereo/speaker combination, the stereo/cable theoretical experiment combined the Distributed Listening results from the stereo mix and 3.5mm cable experiments and combined them to select the most likely spoken phrase in a 6-listener environment. The selection was based on the sum of the local bigram frequency counts, with the candidate phrase with the greatest sum returned as the most likely spoken phrase. The local bigram frequency counts are the same as those that were calculated during the respective 3- listerner experiments. Refer to table 28 for a comprehensive view of the accuracy results. Columns Stereo DL and Cable DL of the table refer to the respective results of the stereo mix and 3.5mm cable experiments. The Distributed Listening column refers to the results of this experiment. 85 Stereo DL Cable DL Distributed Listening Overall 48% 9% 50% Overall Interpretation 52% 9% 52% Valid 54% 7% 57% Valid Interpretation 57% 7% 57% Invalid 43% 11% 43% Invalid Interpretation 46% 11% 46% Table 28 Stereo/Cable Combination Accuracy Rates The results for both the overall and valid categories increased compared to the results from the stereo experiment. In both cases, this is due to the first round of recognitions (table 29). As more listeners correctly ?heard? the actual spoken phrase, the Evidence- Based Phrase Resolution Algorithm correctly reconciled the recognitions. L1 L2 L3 L4 L5 L6 DL Actual Spoken Phrase it's a disease that affects how you move is a disease that affects how you move it is a disease that affects how you move it's a disease that affects how you move it's a disease that affects how you move it's a disease that affects how you move it's a disease that affects how you move it's a disease that affects how you move Table 29 Stereo/Cable Combination Round 1 The remaining categories showed that Distributed Listening remained consistent when compared to the stereo mix experiment. 86 The results of the first round of trials are also reflected in the average WER (table 30). The average WER improved compared to the stereo mix experiment. The stereo mix experiment produced an average WER of .26, whereas Distributed Listening produced an average WER of .25. Phrase Number Stereo DL Cable DL Distributed Listening 1 0.25 0.00 0.00 2 0.00 0.75 0.00 3 0.00 1.38 0.00 4 0.00 0.00 0.00 5 0.00 0.50 0.00 6 0.00 0.33 0.00 7 0.09 1.00 0.09 8 0.63 1.29 0.63 9 0.33 1.22 0.33 10 0.00 0.50 0.00 11 0.00 0.00 0.00 12 0.22 1.22 0.22 13 0.44 3.44 0.44 14 0.54 2.62 0.54 15 0.00 0.31 0.00 16 0.40 0.80 0.40 17 0.00 0.71 0.00 18 0.20 0.50 0.20 19 0.00 0.40 0.00 20 0.08 0.75 0.08 21 0.40 1.00 0.40 22 0.00 0.00 0.00 23 0.00 0.30 0.00 24 0.00 5.13 0.00 25 0.56 5.78 0.56 26 0.00 3.06 0.00 27 0.75 0.83 0.83 28 0.00 2.10 0.00 29 0.00 0.00 0.00 30 0.06 1.17 0.06 31 0.55 1.09 0.55 32 0.33 1.60 0.33 33 0.00 1.40 0.00 87 Phrase Number Stereo DL Cable DL Distributed Listening 34 0.14 1.93 0.14 35 0.00 1.22 0.00 36 0.00 2.50 0.00 37 0.00 3.00 0.00 38 0.00 1.47 0.00 39 0.79 2.00 0.79 40 0.00 3.38 0.00 41 0.00 2.25 0.00 42 0.15 1.08 0.15 43 1.70 3.20 1.70 44 0.16 1.16 0.16 45 0.35 1.26 0.35 46 1.09 1.96 1.09 47 1.67 2.50 1.67 48 0.18 3.27 0.18 49 0.50 2.75 0.50 50 0.94 2.44 0.94 51 0.00 1.25 0.00 52 0.00 0.75 0.00 53 0.75 2.50 0.75 54 0.13 1.50 0.13 55 0.00 1.44 0.00 56 0.00 0.57 0.00 Average 0.26 1.55 0.25 Table 30 Stereo/Cable Combination Word Error Rates Due to the poor recognition results of both the external speaker/microphone and 3.5mm cable experiments, it was not meaningful to combine them into a theoretical experiment. Rather, it was more noteworthy to create a 9-listeners experiment as described next. 5.4.3 9-Listener Combination The final theoretical experiment combined the Distributed Listening results from all of the 3-listeners experiments to simulate a 9-listeners environment. The sum of the local bigram frequency counts was used to determine the most likely spoken phrase in this 88 experiment and the local frequency counts were the same as those calculated during the individual 3-listeners experiments. In the instance where a round from a 3-listeners setup was inconclusive, only the conclusive results from that round were used. The results from this experiment are listed in table 31. Columns Stereo DL, Speaker DL, and Cable DL represent the most likely spoken phrase results of the stereo mix, external speaker/.microphone, and 3.5mm cable experiments, respectively. Stereo DL Speaker DL Cable DL Distributed Listening Overall 48% 7% 9% 50% Overall Interpretation 52% 7% 9% 52% Valid 54% 4% 7% 57% Valid Interpretation 57% 4% 7% 57% Invalid 43% 11% 11% 43% Invalid Interpretation 46% 11% 11% 46% Table 31 Stereo/Speaker/Cable Combination Recognition Accuracy Rates Like the 6-listener stereo/cable experiment, the 9-listener test resulted in an increase in the overall and valid categories compared to the stereo mix experiment for Distributed Listening and is likewise due to the first round of recognitions as shown in table 32. Within the overall category, the accuracy of Distributed Listening at 50% is better than the accuracy from the stereo mix experiment at 48%. Within the valid category, Distributed Listening exceeds the accuracy of the stereo mix experiment by 3%. 89 Recognizer Recognition Listener 1 ? Satellite (Stereo) it's a disease that affects how you move Listener 2 ? Dell (Stereo) is a disease that affects how you move Listener 3 ? Tablet (Stereo) it is a disease that affects how you move Listener 4 ? Satellite (Speaker) the thing that affects how you Listener 5 ? Dell (Speaker) it is believed that affect how you move Listener 6 ? Tablet (Speaker) this disease that affects how you move Listener 7 ? Satellite(Cable) it's a disease that affects how you move Listener 8 ? Dell (Cable) it's a disease that affects how you move Listener 9 ? Tablet (Cable) it's a disease that affects how you move Distributed Listening it's a disease that affects how you move Actual Spoken Phrase it's a disease that affects how you move Table 32 Stereo/Speaker/Cable Combination Round 1 The remaining categories stayed consistent from the stereo mix experiment in that Distributed Listening matched the accuracy of the stereo mix experiment for those categories. The average WER also remained consistent from the stereo mix experiment at .26, as shown in table 33. Phrase Number Stereo DL Speaker DL Cable DL Distributed Listening 1 0.25 0.63 0.00 0.00 2 0.00 2.08 0.75 0.00 3 0.00 2.75 1.38 0.00 4 0.00 0.00 0.00 0.00 5 0.00 1.75 0.50 0.00 6 0.00 0.33 0.33 0.00 7 0.09 1.91 1.00 0.09 8 0.63 0.83 1.29 0.83 9 0.33 0.78 1.22 0.33 10 0.00 0.00 0.50 0.00 11 0.00 0.83 0.00 0.00 12 0.22 1.44 1.22 0.22 13 0.44 3.00 3.44 0.44 90 Phrase Number Stereo DL Speaker DL Cable DL Distributed Listening 14 0.54 0.92 2.62 0.54 15 0.00 0.46 0.31 0.00 16 0.40 0.80 0.80 0.40 17 0.00 2.33 0.71 0.00 18 0.20 1.05 0.50 0.20 19 0.00 0.00 0.40 0.00 20 0.08 1.96 0.75 0.08 21 0.40 3.00 1.00 0.40 22 0.00 0.44 0.00 0.00 23 0.00 1.50 0.30 0.00 24 0.00 0.00 5.13 0.00 25 0.56 0.78 5.78 0.56 26 0.00 0.39 3.06 0.00 27 0.75 1.75 0.83 0.83 28 0.00 2.10 2.10 0.00 29 0.00 0.62 0.00 0.00 30 0.06 3.06 1.17 0.06 31 0.55 1.18 1.09 0.55 32 0.33 2.33 1.60 0.33 33 0.00 inconclusive 1.40 0.00 34 0.14 2.64 1.93 0.14 35 0.00 1.22 1.22 0.00 36 0.00 0.70 2.50 0.00 37 0.00 1.71 3.00 0.00 38 0.00 1.87 1.47 0.00 39 0.79 2.07 2.00 0.79 40 0.00 2.23 3.38 0.00 41 0.00 3.42 2.25 0.00 42 0.15 2.08 1.08 0.15 43 1.70 3.50 3.20 1.70 44 0.16 1.05 1.16 0.16 45 0.35 2.17 1.26 0.35 46 1.09 3.09 1.96 1.09 47 1.67 2.94 2.50 1.67 48 0.18 5.18 3.27 0.18 49 0.50 1.63 2.75 0.50 50 0.94 2.56 2.44 0.94 51 0.00 1.00 1.25 0.00 52 0.00 0.75 0.75 0.00 53 0.75 2.25 2.50 0.75 54 0.13 0.63 1.50 0.13 55 0.00 1.56 1.44 0.00 91 Phrase Number Stereo DL Speaker DL Cable DL Distributed Listening 56 0.00 0.57 0.57 0.00 Average 0.26 1.60 1.55 0.26 Table 33 Stereo/Speaker/Cable Combination Word Error Rates The preceding experiments individually produced optimal results, and a collective, comparative analysis as set forth in the following section will further display the effectiveness of the Distributed Listening system. 5.5 Discussion Each of the three physical experiments and three theoretical experiments produced excellent results. Distributed Listening as a composite system consistently met or out- performed the results from the individual listeners. Overall, for all six experiments, Distributed Listening never performed worse than the best individual listener and in fact, exceeded the accuracy of the best individual listener in the 3.5mm cable experiment and exceeded the baseline metric in the stereo/cable and stereo/speaker/cable experiments within the overall category. Distributed Listening exceeded the best individual listener within the overall interpretation category for both the stereo mix and 3.5mm cable experiments and matched the best individual listener and the baseline metric for the remaining experiments. Within the valid category, Distributed Listening exceeded the best individual listener for the stereo mix experiment and exceeded the baseline metric for both the stereo/cable and stereo/speaker/cable experiments. The remaining experiments showed that Distributed Listening was no worse than the best reported accuracy. 92 The valid interpretation category for the six experiments had the most inconsistent results. While Distributed Listening exceeded the best individual listener of the stereo mix experiment, the accuracy of Distributed Listening for the 3.5mm cable experiment was worse than the best individual listener. This is surprising since it is expected that adding in semantic equivalents will at a minimum not decrease accuracy, and at best, improve accuracy. That was the case for the stereo mix experiment, but not the 3.5mm experiment. As discussed in section 5.3 and shown in table 24, this unexpected result is attributed to the agreement between listeners on misrecognized bigrams. The invalid category across all six experiments presented interesting and unexpected results from Distributed Listening. In five of the six experiments, Distributed Listening returned results that were only as good as the best individual listener. The remaining experiment (stereo mix) showed that Distributed Listening was worse than the best individual listener. Distributed Listening did not exceed the best individual listener during any of the tests for the invalid category. The same can be said for the invalid interpretation category. Not only did Distributed Listening not exceed an individual listener, Distributed Listening was worse than the best during the stereo mix experiment. This can be attributed to the contents of the corpus. Recall that the results in the invalid category are the ensuing recognitions from the actual spoken phrases that do not have bigrams that are maintained in the corpus. Distributed Listening builds the most likely spoken phrase based on combining and concatenating bigrams. If those bigrams are not in the corpus or the original evidence, the phrases that contain those words do not pass validation. Using the corpus as both a validation tool and as a mechanism to break ties requires that the corpus contains words relevant to the domain to have optimal results. 93 With regard to the WER calculations, Distributed Listening again reported excellent results. For each of the three physical experiments, Distributed Listening had the lowest average WER. For each of the three theoretical experiments, Distributed Listening was as good as the baseline metric. When the results of Distributed Listening are taken as a whole, including the poor results from the invalid and invalid interpretation categories, Distributed Listening completely conforms to the theoretical expectations that were established at the start of the research task. The resulting data analysis from the experiments fully supports the hypothesis and establishes that Distributed Listening is a practical composite architecture of distributed ASR systems. 94 6 Conclusion Distributed Listening is a novel approach to improving the accuracy of spoken language systems by using a distributed approach through the use of multiple, yet independent, recognizers with distinct and separate input sources. Through the exhaustive experimentation phase, it was demonstrated that Distributed Listening is indeed a viable solution that complements existing technologies and therefore, adds significant contributions to the area of spoken language systems, as described in the next section. 6.1 Contributions The Distributed Listening research project first addressed the details of the input source of the spoken utterances, one of the common implementation details of systems that use multiple speech recognizers. Previous research activities used one input source that split the speech signal across the multiple speech recognizers. Distributed Listening provides each recognizer with its own independent and physically separate input source, to address this limitation. An additional component of Distributed Listening that adds to the field is the inherent ability to simulate the process of human hearing and deduction. Recall from previous discussions that Distributed Listening is analogous to the deductive reasoning of detectives in combination with the psychological process of dichotic listening. Further 95 developments that mimic that way people process speech will add to the technological advancements of spoken language systems. Lastly, Distributed Listening addresses the need for speaker dependant systems. While systems that are trained will always produce, on average, better accuracy results than untrained, speaker independent systems, this research showed that reasonable recognition rates are possible when systems have not been trained for specific individuals. 6.2 Directions for Future Research There are several areas within Distributed Listening that can benefit from further investigation. As the research progressed, several promising ideas were developed that warrant exploration. This section will explain those ideas. 6.2.1 Architectures There is reason to look at alternative architecture models for Distributed Listening. There are three models that came forth during this project. The first of which is the homogeneous model. The homogeneous model uses the same grammar or language model for each listener. Although all of the listeners are identical in capturing the input, this architecture allows for the different perspectives of the utterances to also be captured. The research presented here follows this model, yet there is a need to compare this model with the other two models. The second proposed architecture is the heterogeneous model. Within the heterogeneous model, each listener uses a different grammar or language model. Each listener will keep its own input source and produce a recognition result. This model implies a distributed 96 grammar/language model and allows for flexibility as very large grammars and vocabularies can be distributed across several listeners. The final proposed architecture is the hybrid model, which contains a homogeneous architecture of heterogeneous distributed listening nodes. The hybrid architecture, as shown in figure 35, gives the embedded environment the ability to recognize multiple languages, as well as accommodate translations of inter-mixed spoken language. Figure 35 Hybrid Distributed Listening Architecture The heterogeneous and hybrid models each add another dimension to the Distributed Listening system and a comparative analysis of the three models should show continued recognition accuracy improvements with ASR systems that use multiple recognizers. In addition to the proposed models, a parallel listening architecture warrants further discussion as described next. 97 6.2.2 Parallel Processing The motivation for this research was to simulate the way in which people hear and process speech with multiple independent listeners. An applicable adaptation of that motivation would be to provide an alternative solution for those students who have hearing impairments in the post-secondary classroom. Within the classroom, the increase in speech recognition accuracy will make the communication between hearing impaired students and instructors more effective by improving the automatic transcription of the spoken lecture, especially in a post-secondary school environment. Currently, students with hearing impairments face a number of educational obstacles that can be exacerbated when the transcription results of a lecture are less than optimal. For instance, hearing impaired students have a learning curve that stems from the translation of the sentence structure of written English words to the established syntax of American Sign Language. Unfortunately for these students, American Sign Language does not follow the grammar rules and sentence structures of the English language (Liddell 1980). Additionally, inaccuracies in the transcription result can directly impact the educational future of the student. Wrong or missing information is time consuming in that the student would spend significant time re-learning information. Post-secondary education for hearing impaired students faces different challenges then secondary education as it is increasingly difficult to keep up with language that is usually more technical and rapid, which results in more note taking. For students who have been able to cope by relying on lipreading, it is now impossible to lipread and take notes simultaneously (Reed 1984). The success of Distributed Listening has the ability to directly impact the success of hearing impaired students in an educational environment. 98 To effectively evaluate Distributed Listening in that environment, a real-time parallel processing scheme is needed. The results presented in this project stemmed from parallel listening through the programming language. An extension of this work that Distributed Listening will further benefit from is to process the interpretations in a real-time parallel processing listening environment. 6.3 Summary Distributed Listening was created with a hypothesis that the system would perform at worst, as good as the best individual listener. The overall interpretation and valid interpretation categories showed the best results for Distributed Listening, which is in keeping with the underlying theme of this research. The motivation of this research was to mimic the way in which humans process speech. When several people are privy to the same spoken input and you rely on those people to repeat what was heard, the versions that are relayed frequently contain interpretations of what was actually said and are prone to inherent errors. The results are only as accurate as the people relaying the information. Adding the computer component eliminates the inconsistencies inherent to people, thereby increasing the effectiveness of the results. The overall interpretation and valid interpretation accuracy results reconciled to the correct meaning of the actual spoken phrases. When the correct meaning is accurately conveyed, this mimics the behavior of people. Once the experiment was complete, the data analysis showed that Distributed Listening performs better with regard to speech recognition than an individual ASR system. Generally, the recognition rates of Distributed Listening, when compared with the 99 recognition rates of the individual standard recognition systems, met or out-performed the individual systems. The experiments that were executed supported the premise and hypothesis of Distributed Listening and confirm that Distributed Listening is a viable alternative to existing methods of ASR systems that use multiple recognizers. 100 References 1. Barry, T., Solz, T., Reising, J. and Williamson, D. The simultaneous use of three machine speech recognition systems to increase recognition accuracy, In Proceedings of the IEEE 1994 National Aerospace and Electronics Conference, vol.2, pp. 667 - 671, 1994. 2. Baum, L.E. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov process. Inequalities 3, 1-8, 1972. 3. Bruder, G.E., Stewart, J.W., McGrath, P.J., Deliyannides, D., Quitkin, F.M. Dichotic listening tests of functional brain asymmetry predict response to fluoxetine in depressed women and men. Neuropsychopharmacology, 29(9), pp. 1752-1761, 2004. 4. Brutti, A., Coletti, P., Cristoforetti, L., Geutner, P., Giacomini, A., Gretter, R., Maistrello, M., Matassoni, M., Omologo, M., Steffens, F. and Svaizer, P., Use of Multiple Speech Recognition Units in a In-car Assistance Systems, chapter in "DSP for Vehicle and Mobile Systems", Kluwer Publishers, 2004. 5. Cristoforetti, L., Matassoni, M., Omologo, M. and Svaizer, P., Use of parallel recognizers for robust in-car speech interaction, In Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing [ICASSP 2003], Hong-Kong, 2003. 6. Deng, L. and Huang, X., Challenges in adopting speech recognition, Communications of the ACM, vol. 47, no. 1, pp. 69-75, January 2004. 7. Duchnowski, P. A New Structure for Automatic Speech Recognition. Diss. Massachusetts Institute of Technology, 1993. 8. Durrant, J. and Lovrinic, J., Bases of Hearing Science, Williams & Wilkins, 1995. 9. Evermann, G. and Woodland, P., Posterior Probability Decoding, Confidence Estimation and System Combination, In Proceedings of NIST Speech Transcription Workshop, 2000. 101 10. Fiscus, J. G., A post-processing system to yield reduced error word rates: Recognizer output voting error reduction (ROVER). In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347?354, 1997. 11. Furui, S., Digital Speech, Processing, Synthesis, and Recognition, Marcel Dekker, Inc., 1989.Furui, S., Recent progress in spontaneous speech recognition and understanding, In Proceedings of the IEEE Workshop on Multimedia Signal Processing, 2002. 12. Gilbert, J.E. and Zhong, Y., Speech User Interfaces for Information Retrieval, In Proceedings of 12th Annual ACM Conference on Information & Knowledge Management, New Orleans, Louisiana, pp. 77-82, 2003. 13. Goel, V., Kumar, S. and Byrne, Segmental Minimum Bayes-Risk ASR Voting Strategies, In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP), Beijing, China, 2000, pp 139-142. 14. Ide, Nancy, and Suderman, Keith (2007). The Open American National Corpus (OANC). http://www.AmericanNationalCorpus.org/OANC. 15. Jurafsky, D. and Martin, J., Speech and Language Processing, Prentice Hall, 2000. 16. Kodama, Y., Utsuro, T., Nishizaki, H., Nakagawa, S., Experimental Evaluation on Confidence of Agreement among Multiple Japanese LVCSR Models, In Proceedings of the 7th European Conference on Speech Communication and Technology, Aalborg, Denmark, 2001, pp 2549?2552. 17. Liddell, S., American Sign Language Syntax, Mouton Publishers, 1980. 18. McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P. and Bourlard, H., On the Use of Information Retrieval Measures for Speech Recognition Evaluation, Idiap Publications, 2005. 19. MP3Gain, [Online]. Available: http://mp3gain.sourceforge.net/, 2010. 20. Natural Language Software Registry, [Online]. Available: http://registry.dfki.de/, 2004. 21. Pierce, A., Acoustics, McGraw-Hill, 1981. 22. Price, P., The Growing Impact of Speech Technology on Society, In AAAS Session Presentation on Language Processing for Science and Society, San Diego, CA, 2010. 23. Reed, M., Educating Hearing-Impaired Children, Open University Press, 1984. 102 24. Schwenk, H. and Gauvain, J., Combining Multiple Speech Recognizers using Voting and Language Model Information, In Proceedings of the IEEE International Conference on Speech and Language Processing (ICSLP), Pekin, pp. II:915?918, 2000. 25. Solsona, R. A., Fosler-lussier, E., Kuo, H., Potamianos, A., Zitouni, I., Adaptive Language Models for Spoken Dialogue Systems, In Proceedings of the ICASSP, 2002. 26. Young, S.R., Hauptmann, A.G. , Ward, W.H. , Smith, E.T. and Werner, P., High level knowledge sources in usable speech recognition systems, Communications of the ACM, vol. 32, no. 2, pp. 183-194, 1989. 27. Young, S.R., Use of dialog, pragmatics and semantics to enhance speech recognition, Speech Communication, vol. 9, pp. 551-564, 1990. 103 Appendix A: Recognition Engines Results ? Stereo Mix Shading indicates a valid actual spoken phrase, meaning each bigram of the phrase is found within the corpus. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Result Actual Spoken Phrase 1. it's a disease that affects how you move is a disease that affects how you move it is a disease that affects how you move it is a disease that affects how you move it is a disease that affects how you move 2. a finger that were goals and you can't really get it to stop a finger that wiggles and you can't really get it to stop a finger wiggles and you can't really get it to stop a finger that wiggles and you can't really get it to stop a finger that wiggles and you can't really get it to stop 3. an arm doesn't swing quite the same but an arm doesn't swing quite the same way an arm doesn't swing quite the same way an arm doesn't swing quite the same way an arm doesn't swing quite the same way 4. your handwriting has changed your handwriting has changed your handwriting has changed your handwriting has changed your handwriting has changed 5. is well in motion before any of these symptoms first become clear is well in motion before any of these symptoms first become clear is well in motion before any of these symptoms first become clear is well in motion before any of these symptoms first become clear is well in motion before any of these symptoms first become clear 6. a lot has happened before you first notice a lot has happened before you first notice it a lot has happened before you first notice it a lot has happened before you first notice it a lot has happened before you first notice it 104 7. the first thing i remember noticing process on closing tingling sensation in my left leg and to some extent my left arm the first thing i remember noticing was this on causing tingling sensation in my left leg and to some extent my left arm the first thing i remember noticing was this on buzzing tingling sensation in my left leg and to some extent my left arm the first thing i remember noticing was this on buzzing tingling sensation in my left leg and to some extent my left arm the first thing i remember noticing was this odd buzzing tingling sensation in my left leg and to some extent my left arm 8. used to feel that my cell phone was vibrating a reach for a 10 minute find that there was nothing there are used to feel that my cell phone is vibrating i'd reach for the men i'd find that there was nothing there are used to feel that my cell phone was vibrating i'd reach for the and i'd find that there was nothing there are used to feel that my cell phone was vibrating i'd reach for the and i'd find that there was nothing there i used to feel that my cell phone was vibrating and i'd reach for it and then i'd find that there was nothing there 9. i noticed that i didn't think that my arm was swinging quite the same way when i draw i noticed that i didn't think that my arm was swinging quite the same way when i draw i noticed that i didn't think that my arm was swinging quite the same way when i draw i noticed that i didn't think that my arm was swinging quite the same way when i draw i noticed that i didn't think that my arm was swinging quite the same way when i jogged 10. the only reason i really did was because of my family history the only reason i really did was because of my family history the only reason i really did was because of my family history the only reason i really did was because of my family history the only reason i really did was because of my family history 11. and that was the prevailing wisdom for a very very long time and that was the prevailing wisdom for a very very long time and that was the prevailing wisdom for a very very long time and that was the prevailing wisdom for a very very long time and that was the prevailing wisdom for a very very long time 12. to stop the disease before it even starts and you stop the disease before it even starts and you stop the disease before it even starts and you stop the disease before it even starts and you'd stop the disease before it even starts 105 13. with stem cell research figure into the genetic connection with stem cell research figure into the genetic connection with stem cell research figure into the genetic connection with stem cell research figure into the genetic connection would stem cell research figure into the genetic connection 14. but the more we learn about the disease more complicated it becomes to the more we learn about the disease more complicated it becomes the more we learn about the disease more complicated it becomes to the more we learn about the disease more complicated it becomes but the more we learn about the disease the more complicated it becomes 15. in lots of ways it made the research task that much more complex in lots of ways it made the research task that much more complex in lots of ways it made the research task that much more complex in lots of ways it made the research task that much more complex in lots of ways it made the research task that much more complex 16. and we also saw the number of researchers left and we also saw the number for searchers left and we also saw the number of researchers left and we also saw the number of researchers left and we also saw that a number of researchers left 17. they went to other places that were more open to stem cell research whether that was in europe or in singapore they went to other places that were more open to some sort research whether that was in europe or in singapore they went to other places that were more open to stem cell research whether that was in europe or in singapore they went to other places that were more open to stem cell research whether that was in europe or in singapore they went to other places that were more open to stem cell research whether that was in europe or in singapore 18. but i think it's pretty clear that it had a negative effect on the way in which the fuel progressed but i think it's pretty clear that it had a negative effect on the way in which the feel progress but i think it's pretty clear that it had a negative effect on the way in which the feel progress but i think it's pretty clear that it had a negative effect on the way in which the feel progress but i think it's pretty clear that it had a negative effect on the way in which the field progressed 106 19. i also understand the moral objections that some people have i also understand the moral objections that some people have i also understand the moral objections that some people have i also understand the moral objections that some people have i also understand the moral objections that some people have 20. for a lot of people it's a dilemma because your faith might be telling you one thing in your body is telling you another for a lot of people it's a dilemma because your faith might be telling you one thing in your body is telling you another for a lot of people it's a dilemma because your faith might be telling you one thing in your body is telling you another for a lot of people it's a dilemma because your faith might be telling you one thing in your body is telling you another for a lot of people it's a dilemma because your faith might be telling you one thing and your body is telling you another 21. but he was young and this happened seemingly overnight but he was young and this happened seemingly overnight but he was young and this happened seemingly overnight but he was young and this happened seemingly overnight but he was young and this had happened seemingly overnight 22. and it does appear that there is a relationship and it does appear that there is a relationship and it does appear that there is a relationship and it does appear that there is a relationship and it does appear that there is a relationship 23. genetics load the gun and the environment pulls the trigger genetics load the gun and the environment pulls the trigger genetics load the gun and the environment pulls the trigger genetics load the gun and the environment pulls the trigger genetics load the gun and the environment pulls the trigger 24. there are medications that make a real difference there are medications that make a real difference there are medications that make a real difference there are medications that make a real difference there are medications that make a real difference 25. that helps with some problems are not with others that helps with some problems are not with others that helps with some problems are not with others that helps with some problems are not with others it helps with some problems and not with others 107 26. i guess i choose to believe that i'll be able to do this for a very long time i guess i choose to believe that i'll be able to do this for a very long time i guess i choose to believe that i'll be able to do this for a very long time i guess i choose to believe that i'll be able to do this for a very long time i guess i choose to believe that i'll be able to do this for a very long time 27. but don't necessarily need or want to know more than but i don't necessarily need or want to know more than ago necessarily need or want to know more than but i don't necessarily need or want to know more than but they don't necessarily need or want to know more than that 28. he had been in declining health for the past year dealing with the long term effects of the stroke he suffered he had been in declining health for the past year dealing with the long term effects of the stroke he suffered he had been in declining health for the past year dealing with the long term effects of the stroke he suffered he had been in declining health for the past year dealing with the long term effects of the stroke he suffered he had been in declining health for the past year dealing with the long term effects of the stroke he suffered 29. and now it's nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language feature 30. the class as a semi improvised look inside a high school in a diverse working class paris neighborhood the class as a semi improvised look inside a high school in a diverse working class paris neighborhood the class as a semi improvised look inside a high school in a diverse working class paris neighborhood the class as a semi improvised look inside a high school in a diverse working class paris neighborhood the class is a semi improvised look inside a high school in a diverse working class paris neighborhood 31. and they all take a deep breath and entered the arena and they all take a deep breath and entered the arena and they all take a deep breath and entered the arena and they all take a deep breath and entered the arena then they all take a deep breath and enter the arena 108 32. he writes on the blackboard and a student makes them stop and define the word he writes on the blackboard and a student makes them stop and define the word he writes on the blackboard and a student makes them stop and define the word he writes on the blackboard and a student makes them stop and define the word he writes on the blackboard and a student makes him stop and define a word 33. is so the teacher has to set aside his plan to save first what would be wrong with that and they know it isn't true and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true 34. by the lesson has been derailed and there were snickers all around and by then the lesson has been derailed and there were snickers all around and by then the lesson has been derailed and there were snickers all around and by then the lesson has been derailed and there were snickers all around and by then the lesson has been derailed and there are snickers all around 35. these kids don't deserve to be educated basic these kids don't deserve to be educated they say these kids don't deserve to be educated these to these kids don't deserve to be educated they say these kids don't deserve to be educated they say 36. let them rot in their dead end world class jobs let them rot in their dead end low class jobs let them rot in their dead end low class jobs let them rot in their dead end low class jobs let them rot in their dead end low class jobs 37. how long can he maintain his equilibrium how long can he maintain his equilibrium how long can he maintain his equilibrium how long can he maintain his equilibrium how long can he maintain his equilibrium 38. he tells his colleagues that it's the job of the teacher to bring kids out he tells his colleagues that it's the job of the teacher to bring kids out he tells his colleagues that's the job of the teacher to bring kids out he tells his colleagues that it's the job of the teacher to bring kids out he tells his colleagues that it's the job of the teacher to bring kids out 109 39. what rouses most of the students is an assignment to write self portraits life rouses most of the students is an assignment to write self portraits what rouses most of the students is an assignment to write self portraits what rouses most of the students is an assignment to write self portraits what finally rouses most of his students is an assignment to write self portraits 40. for a brief spell they see him younger more open and ready to wear for a brief spell they seem younger more open and ready to learn for a brief spell they seem younger more open and ready to learn for a brief spell they seem younger more open and ready to learn for a brief spell they seem younger more open and ready to learn 41. but just when you're getting a warm utopian feeling something bad happens but just when you're getting a warm utopian feeling something bad happens but just when you're getting a warm utopian feeling something bad happens but just when you're getting a warm utopian feeling something bad happens but just when you're getting a warm utopian feeling something bad happens 42. i'll save the hero of the movie threatens to become its bad guy i'll save a hero of the movie threatens to become its bad guy i'll save the hero of the movie threatens to become its bad guy i'll save the hero of the movie threatens to become its bad guy i'll say the hero of the movie threatens to become its bad guy 43. that's what he does with real time classroom scenes are so strong that that's what he does with real time classroom scenes are so startling that's what he does with real time classroom scenes are so start what that's what he does with real time classroom scenes are so start what that's why those real time classroom scenes are so startling 44. they show you that at least until the system can be changed but battles will be moment to moment they show you that at least until the system can be changed the battles will be moment to moment they show you that at least until the system can be changed but battles will be moment to moment they show you that at least until the system can be changed but battles will be moment to moment they show you that at least until the system can be changed the battles will be moment to moment 110 45. republicans disagreed complaining that the bill's tax cuts fall short and that spends too much on things they say will create jobs republicans disagreed complaining that the bill's tax cuts fall short and that spends too much on things they say will create jobs republicans disagreed complaining that the bill's tax cuts fall short and that it spends too much on things they say will create jobs republicans disagreed complaining that the bill's tax cuts fall short and that spends too much on things they say will create jobs republicans disagree complaining that the bill's tax cuts fall short and that it spends too much on things they say won't create jobs 46. shows both product lines fell more than 20 percent last year as fewer people but tories leading up to the holiday season skills in both product lines fell more than 20 percent last year as fewer people bought toys leading up to the holiday season seals the both product lines fell more than 20 percent last year as fewer people but tories leading up to the holiday season seals the both product lines fell more than 20 percent last year as fewer people but tories leading up to the holiday season sales of both product lines fell more than twenty percent last year as fewer people bought toys leading up to the holiday season 47. toymaker v los angeles said in november it would shut about a thousand jobs toy maker based in los angeles said in november it would shut about a thousand jobs toymaker v los angeles said in november it would shut about a thousand jobs toymaker v los angeles said in november it would shut about a thousand jobs the toy maker based here in los angeles said in november it would shed about a thousand jobs 48. in that it wasn't immune from last year's deteriorating economic environment admitted wasn't immune from last year's deteriorating economic environment embedded wasn't immune from last year's deteriorating economic environment in that it wasn't immune from last year's deteriorating economic environment and that it wasn't immune from last year's deteriorating economic environment 49. spread to cut taxes that is the question spirit and cut taxes that is the question sprint or cut taxes that is the question spread to cut taxes that is the question spend or cut taxes that is the question 111 50. slips into berating down and just about every sector and every zip code thinks lip synch derailing down and just about every sector and every zip code slips into berating down and just about every sector and every zip code slips into berating down and just about every sector and every zip code pink slips seem to be raining down in just about every sector and every zip code 51. but are there any bright spots out there but are there any bright spots out there but are there any bright spots up in but are there any bright spots out there but are there any bright spots out there 52. the temporary job market may provide an answer the temporary job market may provide an answer the temporary job market may provide an answer the temporary job market may provide an answer the temporary job market may provide an answer 53. to become increasingly physical to become increasingly difficult to become increasingly physical to become increasingly physical it's become increasingly physical 54. the authorities have been slow to address at authorities have been slow to address at authorities have been slow to address the authorities have been slow to address at the authorities have been slow to address it 55. an egyptian court convicted a man of sexual harassment an egyptian court convicted a man of sexual harassment an egyptian court convicted a man of sexual harassment an egyptian court convicted a man of sexual harassment an egyptian court convicted a man of sexual harassment 56. women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful now women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful now women's rights advocates are cautiously hopeful now 112 Appendix B: Recognition Engines Results ? External Speaker/Microphone Shading indicates a valid actual spoken phrase, meaning each bigram of the phrase is found within the corpus. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Result Actual Spoken Phrase 1. the thing that affects how you it is believed that affect how you move this disease that affects how you move this disease that affects how you move it's a disease that affects how you move 2. it is illegal to give up a dearth of legal and you can't really do dot over the legal can't really get it a dearth of legal and you can't really do dot a finger that wiggles and you can't really get it to stop 3. on the going quite a form that is going quite in the an armed point 5 a form that is going quite in the an arm doesn't swing quite the same way 4. your handwriting and your handwriting has changed your handwriting has changed your handwriting has changed your handwriting has changed 5. well in motion with his and a rolling motion before any of the will become clear no one motion before any clear a rolling motion before any of the will become clear is well in motion before any of these symptoms first become clear 6. a lot has we are a lot has happened before you first notice a lot has before you first note a lot has happened before you first notice a lot has happened before you first notice it 113 7. the thing i remember is a leg the first thing i remember noticing what was on london england vision in my left leg of my left arm the first thing i remember noticing what one thing dangling from fiction in my left leg in a pilot are the first thing i remember noticing what one thing dangling from fiction in my left leg in a pilot are the first thing i remember noticing was this odd buzzing tingling sensation in my left leg and to some extent my left arm 8. i used to feel that myself and a i used to feel that my cell phone is vibrating a week or another i'm not there i used to feel that my cell phone is vibrating a reach for a do not find that there was nothing there i used to feel that my cell phone is vibrating a reach for a do not find that there was nothing there i used to feel that my cell phone was vibrating and i'd reach for it and then i'd find that there was nothing there 9. i noticed that my mom will i noticed that i didn't think that my arm was in quite the same way when i draw i noticed that i didn't think that my mom in quite the way i draw i noticed that i didn't think that my mom was in quite the same way when i draw i noticed that i didn't think that my arm was swinging quite the same way when i jogged 10. only reason i'm the only reason i really did was because of my family history the only reason i really didn't think about my family history the only reason i really did was because of my family history the only reason i really did was because of my family history 11. not with them is and that was the prevailing wisdom works very very well and that was the prevailing wisdom for a very very and that was the prevailing wisdom for a very very and that was the prevailing wisdom for a very very long time 12. ethos of the for the can you spot the disease before they can start and you thought the defeat before the start can you spot the disease before they can start and you'd stop the disease before it even starts 114 13. and so it is is a well researched figure in the genetic with stem cell research figure in a way with stem cell research figure in a way would stem cell research figure into the genetic connection 14. more we learn about the more we learn about the disease more complicated become the more we learn about the need for complicated the more we learn about the disease more complicated become but the more we learn about the disease the more complicated it becomes 15. lots of ways we are in lots of ways and means the research task that much more complex in lots of ways to leave the research task that much more in lots of ways and means the research task that much more complex in lots of ways it made the research task that much more complex 16. the office of and we also saw the number of lot and we also saw the number of researchers are and we also saw the number of researchers are and we also saw that a number of researchers left 17. little is known as they went to the place of the more open about what is in europe or in they went to the place of a more open system without we in europe will report they went to the place of a more open system without we in europe will report they went to other places that were more open to stem cell research whether that was in europe or in singapore 18. i think it's pretty neat that it had a negative effect on the way in which but i think it's pretty clear that it had a negative impact on the way in which but i think it's pretty clear that it had a negative effect on the way in which but i think it's pretty clear that it had a negative effect on the way in which but i think it's pretty clear that it had a negative effect on the way in which the field progressed 115 19. i also understand i also understand the moral objections that some people have i also understand the moral objection to come people have i also understand the moral objections that some people have i also understand the moral objections that some people have 20. a lot of people he is not a lot of people it was the one because and let me tell you one thing in your body when you are a lot of people think of when you think might be going with you and your body will not a lot of people it was the one because and let me tell you one thing in your body when you are for a lot of people it's a dilemma because your faith might be telling you one thing and your body is telling you another 21. yeah and if it happened overnight jan and i are not at the end of happen overnight yeah and if it happened overnight but he was young and this had happened seemingly overnight 22. it does appear as if there is a relationship it does appear that there is a relationship it does appear that there is a relationship it does appear that there is a relationship and it does appear that there is a relationship 23. you know the guys in the environmental trigger genetic flow the gun and the environment kinetic flow the gun and the environmental trigger kinetic flow the gun and the environmental trigger genetics load the gun and the environment pulls the trigger 24. there are there are medications that make a real difference there are medications that make up the there are medications that make a real difference there are medications that make a real difference 25. some problems with others it helped him problems are not with others it helps that some problems are not with other it helps that some problems are not with other it helps with some problems and not with others 116 26. can you believe it is a very i guess i choose to believe that i'll be able to do this for very long i guess i choose to believe that i'll be able to do that for her i guess i choose to believe that i'll be able to do this for very long i guess i choose to believe that i'll be able to do this for a very long time 27. there is really no there surely need or want more without necessarily need or want to know what without necessarily need or want to know what but they don't necessarily need or want to know more than that 28. been in declining health here in the long term effects of given the declining health of the year dealing with the long term effect will he had been in declining health for the past year in the long term effects of stroke he in declining health of the year long with year in the long term effects of stroke he had been in declining health for the past year dealing with the long term effects of the stroke he suffered 29. nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language and that's nominated for an academy award at our language and now it's nominated for an academy award for best foreign language and now it's nominated for an academy award for best foreign language feature 30. if you improvise inside the high school and working class and the class and the improvised look inside the high school in the diverse working class parent that a classic family and provide for the entire high school were working class neighborhood the you and inside the high school and high class and the diverse working class parent that the class is a semi improvised look inside a high school in a diverse working class paris neighborhood 31. you think is the envy of the they all take a deep breath and interviewing and they all take a deep breath and interviewing and they all take a deep breath and interviewing then they all take a deep breath and enter the arena 117 32. month on the blackboard is often defined as work on the blackboard at make them stop and the following were from the blackboard at length album to follow the word work on the blackboard at make them stop and the following were he writes on the blackboard and a student makes him stop and define a word 33. though the feature set aside as he was the one that has been known as for the picture is satisfied when they first what would be wrong with that and then know if as for the future have to set aside when they are well with you will not admit to it inconclusive and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true 34. and by the less than an in your microphone why the left has been rao in your knickers all well by then the lesson of thin radio in your personal and by the has been an in your knickers all and by then the lesson has been derailed and there are snickers all around 35. if he is these kids don't deserve to be educated they fit if kids don't deserve to be addicted to if kids don't deserve to be educated to fit these kids don't deserve to be educated they say 36. let them log in her job let them live in their dead end low cost jobs let them wanted to get into a glass jaw let them live in their dead end low cost jobs let them rot in their dead end low class jobs 37. how long is how long can we maintain this week with how long can we maintain equilibrium how long can we maintain this week with how long can he maintain his equilibrium 38. currently jobless thing is that until his colleagues at the job of the future to clean kids out he told his colleagues that the job of going into until his colleagues at that the job of going to clean kids out he tells his colleagues that it's the job of the teacher to bring kids out 118 39. i've only what rouses most of the students an assignment for white silk what relative to the field of furniture like to what rouses most of the students an assignment for white silk what finally rouses most of his students is an assignment to write self portraits 40. dell at the end when a how to spell and younger or older and ready to tell you what to do and we were how to spell and younger or older and ready to for a brief spell they seem younger more open and ready to learn 41. getting a room with bath adjust when you're getting a warning to you about how to just when you will warn you data adjust when you getting a you to you about how to but just when you're getting a warm utopian feeling something bad happens 42. movie at home that are authentic hero of the movie threatens to become a backup david hill of the movie to become a bad guy david hill of the movie to become a bad guy i'll say the hero of the movie threatens to become its bad guy 43. one note of caution is what one does when you're taught in classrooms she was so startled i flew to new york on classroom she was so startled what one does when you're taught in classrooms she was so startled that's why those real time classroom scenes are so startling 44. ensure that you have been showing the battle zone mode they show you that if until the system can change the battle moving moment to moment they show you that until the fifth inning battle moving moment and will they show you that if until the system can change the battle moving moment to moment they show you that at least until the system can be changed the battles will be moment to moment 119 45. you can disagree complaining that the a short and and want something that will create jobs republicans disagree completely with the dole tax cut short and that to march on thing that it will create jobs republicans disagree completely with the dole tax cut short and that are too large on thing that it will reach our republicans disagree completely with the dole tax cut short and that are that on thing that it will create jobs our republicans disagree complaining that the bill's tax cuts fall short and that it spends too much on things they say won't create jobs 46. you people are going for 20 percent of your votes will mean little if super bowl product line still more than 20 percent of a church pew with people while to what he will google product line for more than 20 work here is what we will super bowl product line for more than 20 percent of a what we with people while to what he will sales of both product lines fell more than twenty percent last year as fewer people bought toys leading up to the holiday season 47. i think that we should have about a thousand euro employers are very filtered through november it would shut about it lawmaker victor one third of november in which about a thousand job lawmaker victor one third of november in which about a thousand job the toy maker based here in los angeles said in november it would shed about a thousand jobs 48. if you are not in charge of economic has everyone been somewhat deteriorating economic who would you want your picture when you are who would you want your picture when you are and that it wasn't immune from last year's deteriorating economic environment 49. but the fact that you sperry will cut taxes that you question very low cut back to that question sperry will cut taxes that you question spend or cut taxes that is the question 120 50. it's one thing to be written down and just about every sector and to the growth thanks lip synch derailing barometric but it works or at your throat it's one thing to berating her out of just about every sort are are are are it's one thing to be but it and just about every sort are are the are pink slips seem to be raining down in just about every sector and every zip code 51. the other day and bought the are there any bright spots off the are there any bright spots off the are there any bright spots off the but are there any bright spots out there 52. temporary job market may provide a the temporary job market may provide an effort temporary job market may provide better the temporary job market may provide an effort the temporary job market may provide an answer 53. unique vocal it's becoming increasingly difficult it's becoming increasingly difficult it's becoming increasingly difficult it's become increasingly physical 54. but i will let you the authority has been slow to address it authorities have been slow to address the authority has been slow to address it the authorities have been slow to address it 55. an egyptian court convicted a man shall an egyptian court convicted a man an egyptian court convicted a man of such an egyptian court convicted a man of such an egyptian court convicted a man of sexual harassment 56. we are advocates are cautiously hopeful women's rights advocates are cautiously hopeful that women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful that women's rights advocates are cautiously hopeful now 121 Appendix C: Recognition Engines Results ? 3.5mm Auxiliary Cable Shading indicates a valid actual spoken phrase, meaning each bigram of the phrase is found within the corpus. Listener 1 (Satellite) Listener 2 (Dell) Listener 3 (Tablet) Distributed Listening Result Actual Spoken Phrase 1. it's a disease that affects how you move it's a disease that affects how you move it's a disease that affects how you move it's a disease that affects how you move it?s a disease that affects how you move 2. for the winter encampment to stop a finger wiggle and you can't really get it full stop a finger wiggles and you can't really get it full stop a finger wiggles and you can't really get it full stop a finger that wiggles and you can't really get it to stop 3. on the swing quite the thing that unarmed doesn't swing quite the same an armed swing quite the same an armed swing quite the same an arm doesn't swing quite the same way 4. your handwriting has changed your handwriting has changed your handwriting has changed your handwriting has changed your handwriting has changed 5. if i should put this clear to if whirling motion up short in the ship's first clear it's well in motion before any of these symptoms will become clear it's well in motion before any of these symptoms will become clear is well in motion before any of these symptoms first become clear 6. a lot has happened before the first to a lot has or uke perched notes a lot has happened before you first notice a lot has happened before you first notice a lot has happened before you first notice it 122 7. the first thing i remember noticing what are you handling temptation in my left leg in our the first thing i remember notice you aren't posting links on station in my left leg and stomach stunt pilot are the first thing i remember noticing what i've been tingling sensation in my left leg and some of my left arm the first thing i remember noticing what i've been tingling sensation in my left leg and some of my left arm the first thing i remember noticing was this odd buzzing tingling sensation in my left leg and to some extent my left arm 8. i just feel that my cell phone is vibrating a reach for an ipod or not they are i used to feel that myself up to by birdie at reach for it cannot find it there but not there i used to feel that my cell phone with vibrating a reach for and i find nothing there i used to feel that my cell phone with vibrating a reach for and i find nothing there but not there i used to feel that my cell phone was vibrating and i'd reach for it and then i'd find that there was nothing there 9. i noticed that i didn't think that my arm was swinging our way to draw i noticed that i didn't think my arm what's going in quite the same way i sure i noticed that i didn't think my arm swinging quite the same way i draw i noticed that i didn't think my arm what's going in quite the same way i sure i noticed that i didn't think that my arm was swinging quite the same way when i jogged 10. the reason i really did was because of my family the only reason i rate it just the only reason i really do because of my family history the only reason i really do because of my family history the only reason i really did was because of my family history 11. if the prevailing wisdom for a very very and that was the prevailing list for surgery and that was the prevailing wisdom for a very very long time and that was the prevailing wisdom for a very very long time and that was the prevailing wisdom for a very very long time 12. to stop the disease before they can start can you stop the disease before they can start each stop the disease before the storm can you stop the disease before they can start and you'd stop the disease before it even starts 123 13. what the research figure into it you will with stem cell research figure in chief that you won't one of the research figure in you that you won't one of the research figure in you that you won't would stem cell research figure into the genetic connection 14. learn about the disease more complicated the more we learn about the new work the more we learn about the need for public the more we learn about the need for public but the more we learn about the disease the more complicated it becomes 15. in lots of ways in the research task that much more lots way they researched that much work in lots of ways to name the research task that much more complex in lots of ways to name the research task that much more complex in lots of ways it made the research task that much more complex 16. the office of a number of researchers was and we also saw the number for searchers walked and we also saw the number of researchers are and we also saw the number of researchers are and we also saw that a number of researchers left 17. memento of the plaintiff or defendant without we weren't in the they went other places that were more open to stem cell research without which in europe or in singapore they went to other places are more open systems research whether that was in europe when singapore they went other places that were more open to stem cell research without which in europe or in singapore they went to other places that were more open to stem cell research whether that was in europe or in singapore 18. but i forget what you said it had a negative effect on the way in which the but i think it's pretty clear that it had a negative impact on the way in which this but i think it's pretty clear that it had a negative effect on the way in which the filter graph but i think it's pretty clear that it had a negative effect on the way in which the filter graph but i think it's pretty clear that it had a negative effect on the way in which the field progressed 124 19. i also understand the moral objection to have the i also understand the moral objection to some people have i also understand the moral objections that some people have i also understand the moral objection to some people have i also understand the moral objections that some people have 20. a lot of people think of when you might be telling when your body will not for a lot of people it's a dilemma because you think might be telling one thing in your body telling you not a lot of people it's a dilemma because your faith might be telling them your body will for a lot of people it's a dilemma because you think might be telling one thing in your body telling you not for a lot of people it's a dilemma because your faith might be telling you one thing and your body is telling you another 21. mouth and if it happened seemingly overnight but he was not and if it happened seemingly overnight but he was not and if it happened overnight but he was not and if it happened seemingly overnight but he was young and this had happened seemingly overnight 22. and it does appear that there is a relationship and it does appear that there is a relationship it does appear that there is a relationship and it does appear that there is a relationship and it does appear that there is a relationship 23. you will become an environmental genetics load the guns and the environment pulled the trigger genetic load the gun and the environmental trigger genetics load the guns and the environment pulled the trigger genetics load the gun and the environment pulls the trigger 24. there are medications that make a real there are medications that make out there are education it helps that some problems are not with others there are medications that make a real difference 25. it helps the phone problems are not with other it helps with some auto show not without it helps that some problems are not with others i guess i choose to believe that i'll be able to do this for a very long it helps with some problems and not with others 125 26. perfectionist and i've enabled it to the i guess i choose to believe that i'll be able to do that starter i guess i choose to believe that i'll be able to do this for a very long but don't necessarily need or want to know more than i guess i choose to believe that i'll be able to do this for a very long time 27. assuming you don't want to know what to but don't necessarily need or want to know more than they don't necessarily need or want to know but don't necessarily need or want to know more than but they don't necessarily need or want to know more than that 28. you know the kind of health for the past year are the long term effects of the he had been in declining health for the past year and what the long term effects of stroke each you don't declining health of your dealing with the long term effects of the stroke he suffered you don't declining health of health for the past year are the long term effects of the stroke he had been in declining health for the past year dealing with the long term effects of the stroke he suffered 29. nominated for an academy award for best language and now it's nominated for an academy award for best foreign language feature and nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language feature and now it's nominated for an academy award for best foreign language feature 30. the class into semi improvised look inside the high school weren't working or are you the class and family and provide looking tight high school in the first working class parent the class is a semi improvised the inside high school in the diverse working class parents need to the class is a semi improvised the inside high school in the diverse working class parents need to the class is a semi improvised look inside a high school in a diverse working class paris neighborhood 31. take a deep breath and interviewing then they all take a deep breath and interviewing then we all take a deep breath and interview me then we all take a deep breath and interview me then they all take a deep breath and enter the arena 126 32. from the blackboard and a student once on the blackboard and a student makes stop in the photo were from the blackboard at the next stop in the following work once on the blackboard and a student makes stop in the photo were he writes on the blackboard and a student makes him stop and define a word 33. my favorite feature have to set aside when he wrote to the world that is true to at so the teacher has to set aside when he first one would be wrong with that and then go would have been true i thought the feature set aside when they first what would be wrong with that i know what you think at so the teacher has to set aside when he first one would be wrong with that and then go would have been true and so the teacher has to set aside his plan and say first what would be wrong with that and then no it isn't true 34. left within rao and it works well and by the lesson has been rao nurse who told well by then the lesson has been real in your sneakers although by then the lesson has been real in your sneakers although and by then the lesson has been derailed and there are snickers all around 35. teach kids to be educated to these kids don't serve to be educated faith these kids don't deserve to be defeated if it these kids don't deserve to be defeated if it these kids don't deserve to be educated they say 36. let them live in their quest job let them live in your skin will cluster out let them watch you get a world class job let them live in your skin will cluster out let them rot in their dead end low class jobs 37. how well do you think we how long can we maintain this how long can mean anything from the how long can mean anything from the how long can he maintain his equilibrium 38. don't you think the job of the future will have to he told his colleagues that the job of the future from being kicked out joseph cawley is the job of the feature to link it to he told his colleagues that the job of the future from being kicked out he tells his colleagues that it's the job of the teacher to bring kids out 127 39. rouses most of us know so we both houses most of the students an assignment like self quick row since most of the students an assignment to write self with row since most of the students an assignment to write self with what finally rouses most of his students is an assignment to write self portraits 40. put a spell on her way to the how to spell a younger and we were spell over more of the writing from the spell a spell on her way to the for a brief spell they seem younger more open and ready to learn 41. just to let you know if god just when you're getting warmed up fuel back but just when you're getting a warm welcome fuel from backup but just when you're getting a warm welcome fuel from backup but just when you're getting a warm utopian feeling something bad happens 42. see this movie to come back to the authentic hero of the movie threatens to become a bad guy if you love the movie threatens to become bad for authentic hero of the movie threatens to become a bad guy i'll say the hero of the movie threatens to become its bad guy 43. 20 posted on classroom to it's one of those you'll find quite so stark it's one of those with you on classroom or so start it's one of those with you on classroom or so start that's why those real time classroom scenes are so startling 44. show me the oktoberfest show about a moment in the they show you that until the system can be changed to battle and moment to moment to they show you the until the system can be changed to battle moment to moment they show you that until the system can be changed to battle and moment to moment to they show you that at least until the system can be changed the battles will be moment to moment 128 45. republican disagree completely with the bush tax cut that to what you say what we ought to republican disagree complaining that the dole tax cut short and that spends too much on thing that i walk rich off of the disagree completely with the bill for a short and bit too much on things they say will create jobs republican disagree completely with the dole tax cut short and that spends too much on things they say will create jobs republicans disagree complaining that the bill's tax cuts fall short and that it spends too much on things they say won't create jobs 46. if you are going for 20 percent more accurate about what you filled with both water flowing through more than 20 percent watcher scooped up about two weeks we will refute filled with both product lines fell more than 20 percent last year and the people bought the week leading up to the holy filled with both product lines fell more than 20% last year and the people bought the week leading up to the holy sales of both product lines fell more than 20 percent last year as fewer people bought toys leading up to the holiday season 47. i think you have a show about one thousand the toymaker victor will start in november it would shut about it toymaker victor walter of november in which about one thousand job toymaker victor will start in november it would shut about it the toy maker based here in los angeles said in november it would shed about one thousand jobs 48. live from the future of economic has everyone been somewhat deteriorating economic if it wasn't immune from western european economic if it wasn't immune from western european economic and that it wasn't immune from last year's deteriorating economic environment 49. split effective at sparing the attack that question sparing the fact that you at sparing the attack that question spend or cut taxes that is the question 129 50. it's one thing to be way out of just about every to the at thing to a finger berating barometric about every starter at co think the thing to be raining down at the airport sector and for the growth think the thing to be raining down at the airport sector and for the growth pink slips seem to be raining down in just about every sector and every zip code 51. okay but i thought are there any bright spots out are there any bright spot for the are there any bright spot for the but are there any bright spots out there 52. but the job market may provide an the temporary job market lake water the temporary job market may provide an effort the temporary job market may provide an effort the temporary job market may provide an answer 53. it's becoming increasingly difficult heat become increasingly pitiful each become increasingly difficult each become increasingly difficult it's become increasingly physical 54. authorities have been slow to address it the authority has been slow to attract the authority to go to the authority has been slow to attract the authorities have been slow to address it 55. an egyptian cleric convicted a man of such the an egyptian court convicted a man of such an egyptian court convicted a man of schoharie an egyptian court convicted a man of such the an egyptian court convicted a man of sexual harassment 56. women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful women's rights advocates are cautiously hopeful now 130 Appendix D: Internal Review Board Approval