POST-SPEECH-RECOGNITION PROCESSING IN DOMAIN-SPECIFIC 
TEXT-CORPUS-BASED DISTRIBUTED LISTENING SYSTEM: 
ANALYSIS, INTERPRETATION AND SELECTION 
OF SPEECH RECOGNITION RESULTS 
 
 
Except where reference is made to the work of others, the work described in this thesis 
is my own or was done in collaboration with my advisory committee. 
This thesis does not include proprietary or classified information. 
 
 
 
__________________ 
Spencer Jaehoon Lee 
 
 
Certificate of Approval: 
 
 
 
_______________________________  ______________________________ 
Cheryl D. Seals     Juan E. Gilbert, Chair 
Assistant Professor     Associate Professor 
Computer Science and Software   Computer Science and Software 
Engineering      Engineering 
 
 
 
_______________________________  ______________________________ 
Gerry V. Dozier     Stephen L. McFarland 
Associate Professor     Dean 
Computer Science and Software   Graduate School 
Engineering
POST-SPEECH-RECOGNITION PROCESSING IN DOMAIN-SPECIFIC 
TEXT-CORPUS-BASED DISTRIBUTED LISTENING SYSTEM: 
ANALYSIS, INTERPRETATION AND SELECTION 
OF SPEECH RECOGNITION RESULTS 
 
 
Spencer Jaehoon Lee 
 
 
A Thesis 
Submitted to 
the Graduate Faculty of 
Auburn University 
in Partial Fulfillment of the 
Requirements for the 
Degree of 
Master of Science 
 
 
Auburn, Alabama 
December 15, 2006 
 
iv 
POST-SPEECH-RECOGNITION PROCESSING IN DOMAIN-SPECIFIC 
TEXT-CORPUS-BASED DISTRIBUTED LISTENING SYSTEM: 
ANALYSIS, INTERPRETATION AND SELECTION 
OF SPEECH RECOGNITION RESULTS 
 
Spencer Jaehoon Lee 
 
Permission is granted to Auburn University to make copies of this thesis at its discretion, 
upon the request of individuals or institutions and at their expense. 
The author reserves all publication rights. 
 
 
 
_______________________________ 
Signature of Author 
 
 
 
_______________________________ 
Date 
 
v 
THESIS ABSTRACT 
 
POST-SPEECH-RECOGNITION PROCESSING IN DOMAIN-SPECIFIC 
TEXT-CORPUS-BASED DISTRIBUTED LISTENING SYSTEM: 
ANALYSIS, INTERPRETATION AND SELECTION 
OF SPEECH RECOGNITION RESULTS 
 
Spencer Jaehoon Lee 
 
Master of Science, December 15, 2006 
(B.E., Konkuk University, August 2003) 
 
68 Typed Pages 
Directed by Juan E. Gilbert 
 
Achieving usable recognition rates has been an almost never-ending quest in speech 
recognition research for more than three decades. Recently speech recognition rates have 
dramatically improved in conjunction with the rapid development of computer 
technology, but it has never been enough to satisfy human expectation. Many researchers 
tried to testify the benefit of using multiple speech recognizers in improving recognition 
rates. The fundamental idea supporting this research trend is that recognition results 
agreed upon by a majority of recognizers can be considered correct. This paper tries to 
vi 
break the old idea which may prevent multi-recognizer researches forever from achieving 
usable recognition rates, by revealing the existence of common misrecognition (CMR) 
results agreed upon by the majority. The common misrecognition results are classified 
into several categories (contraction, missed words, spoken stop words, homophone, and 
combined misrecognition) and treated according to their characteristics. A collection of 
sentences users may speak (simple text-corpus) is used in order to overcome very low 
sentence recognition rates of speech systems. It is suggested that composite information 
made out of multiple recognition results is enough to correctly find its actual target 
sentence among thousands of sentences in a specific domain. Overall the results (87% of 
sentence recognition rate) of experiments conducted in this research strongly support that 
the processes described in this paper can greatly improve speech recognition rates of 
multi-recognizer systems. 
vii 
Style manual or journal used: Journal of SAMPE
 
Computer software used: Microsoft Word 2003 
viii 
TABLE OF CONTENTS 
LIST OF FIGURES ............................................................................................................ x 
LIST OF TABLES............................................................................................................. xi 
CHAPTER 1 INTRODUCTION ........................................................................................ 1 
CHAPTER 2 LITERATURE REVIEW ............................................................................. 5 
CHAPTER 3 INITIAL STUDY ....................................................................................... 16 
3.1 Analysis of preliminary experiment data of Distributed Listening (DL) ............... 16 
3.2 Input Analysis......................................................................................................... 18 
3.2.1 Pattern matching .............................................................................................. 19 
3.2.2 The progress of pattern matching .................................................................... 23 
3.3 Domain-specific Simple text-corpus database........................................................ 24 
3.4 Normalization ......................................................................................................... 25 
3.5 Experiment 1........................................................................................................... 27 
3.5.1 The result of experiment 1 ............................................................................... 28 
CHAPTER 4 IMPLEMENTATION AND EXPERIMENTS .......................................... 30 
4.1 Common misrecognition (CMR)............................................................................ 30 
4.1.1 Contractions (CT) ............................................................................................ 30 
4.1.1.2 Contractions in English............................................................................. 30 
4.1.1.3 Contractions in CMR ................................................................................ 32 
4.1.1.4 Contraction or expansion? ........................................................................ 32 
4.1.2 Misrecognized words (MRW) ......................................................................... 33 
4.1.3 Missed words (MW) ........................................................................................ 33 
4.1.4 Stop words ....................................................................................................... 34 
4.1.4.1 Stop words treatment ................................................................................ 36 
4.1.4.2 Spoken stop words .................................................................................... 36 
ix 
4.1.5 Homophone (HMP) ......................................................................................... 38 
4.1.5.1 HMP treatment.......................................................................................... 39 
4.1.6 Combined misrecognition (CBMR)................................................................. 39 
4.1.6.1 CBMR treatment....................................................................................... 40 
4.2 Matching process .................................................................................................... 41 
4.3 Experiment 2........................................................................................................... 43 
4.3.1 The result of experiment 2 ............................................................................... 45 
4.4 Experiment 3 (Overall design of interpreter).......................................................... 46 
4.4.1 The result of experiment 3 ............................................................................... 48 
CHAPTER 5 CONCLUSIONS ........................................................................................ 52 
BIBLIOGRAPHY............................................................................................................. 55 
x 
LIST OF FIGURES 
 
Figure 1: The progress of the development of speech recognition..................................... 3 
Figure 2: Hardware configuration used in the experiment of [Barry et. al. 1994] ............. 5 
Figure 3: ROVER system architecture ............................................................................... 7 
Figure 4: The progress of forming WTN............................................................................ 7 
Figure 5: An example of a composite WTN....................................................................... 8 
Figure 6: The result of voting process ................................................................................ 8 
Figure 7: Word error rates in function of the number of combined system ....................... 9 
Figure 8: Parallel recognition units................................................................................... 12 
Figure 9: Input Analysis.................................................................................................... 19 
Figure 10: 16-pack pattern for two inputs......................................................................... 20 
Figure 11: 64 patterns for input analysis........................................................................... 21 
Figure 12: The progress of pattern matching.................................................................... 22 
Figure 13: Experiment 1 setup.......................................................................................... 28 
Figure 14: Matching process ???????????????????????.42 
Figure 15:  Experiment 2 setup ?????????????????..?...???44 
Figure 16: Experiment 3 setup ??????????????????..????49 
Figure 17: Matching process ??????????????????...?.?.??50 
Figure 18: Overall interpretation progress ???????????.??...????51 
xi 
LIST OF TABLES 
 
Table 1: Relative improvement rate with respect to the best single recognizer (17.1%) . 10 
Table 2: Word recognition rates of DL preliminary experiment ...................................... 13 
Table 3: Analysis results of preliminary experimental data ............................................. 17 
Table 4: Common misrecognition collection (CMR) ....................................................... 31 
Table 5: Stop words lists................................................................................................... 37 
Table 6: Contribution rates of each CMR treatment......................................................... 45 
1 
CHAPTER 1 
INTRODUCTION 
 
Speech is a very basic, but the most natural and effective communication method that 
humans possess. Though there are many other effective means of human communication 
such as writing, gesture, and facial expressions, communication through speech is 
preferred especially when people exchange information and establish relationships with 
each other. [Pinker 1994] 
 
As computer technology rapidly grows, development of communication methods 
between human and computers has also grown greatly. Especially graphical user 
interfaces (GUI) using traditional input-output devices such as screen display, keyboard, 
mouse, and joystick are in the very mature stage of usability and their further 
developments are focusing on increasing fidelity in order to meet the highly increasing 
expectation of human users and ergonomic usability to reduce repetitive strain injury 
(RSI). 
 
As an alternative means of human-computer interaction, speech recognition technology 
has been developed with a great deal of effort from researchers for more than 30 years. In 
spite of the very rapid progress of recent speech technology development prompted by 
2 
advances in computer power, algorithm and memory capacity, there still exists a big 
usability gap between human expectation and accuracy of speech recognition. [Deng and 
Huang 2004] 
 
The traditional GUI is not always optimal for human-computer communication. In 
limited situations with small devices such as cell phones and PDAs, or driving a car, the 
use of a GUI is very limited or almost impossible. Even in normal working environments, 
it is much quicker to dictate text to a computer than to type it by hand [Larsen et al.1992]. 
In contrast, speech interfaces can provide consistent usability across all computer devices. 
[Deng and Huang 2004] 
 
In highly developed countries like the United States, the rapid increase of the aged 
population (over 65 years of age) is a very common. The percentage of this population in 
the United States in 2000 was 12.6%, and it is estimated that the percentage will rise up 
to 20% by 2030 [Gavrilov and Heuveline 2003]. Aged people are not very familiar with 
interacting with computers and gradually lose their mobility and dexterity more and more 
as they age. Therefore, the development of an alternative means of human-computer 
interaction to the traditional GUI is very encouraged and emphasized, because the easy 
use of computer technology in conjunction with robotics technology may make their 
elderly life much better or even allow them to work beyond traditionally expected 
retirement ages [Stevenson and McQuivey 2003]. The best alternative communication 
means to the traditional GUI can be speech interactions, because most people retain their 
speaking ability throughout life. 
Figure 1 from [Deng and Huang 2004] shows that word recognition error rate under 
controlled environment has gradually decreased by an average of about 10% every year 
during the past two decades. After the recognition rate reached beyond 90%, two new big 
research trends came out, which are speech recognition under normal acoustic (noisy) 
environment and conversational (casual) speech recognition. [Deng and Huang 2004]  
 
 
Figure 1: The progress of the development of speech recognition 
 
 
Another research trend for achieving usable speech recognition is the use of multiple 
speech recognition engines to get a better overall speech recognition result. A sentence is 
a collection of words. Likewise speech recognition of a sentence is a result of continuous 
speech recognition of single words. Therefore successful recognition of a sentence 
depends on successful recognition of every single word in the sentence. Most speech 
recognition rates in speech recognition literatures refer to word recognition rates not 
3 
4 
sentence recognition rates. Even though word recognition rate reached beyond 90%, 
sentence recognition rate is still very poor.  
The purpose of using multiple speech recognizers is to make a best (preferably a 
complete) composite output out of several incomplete recognition results by 
complementing each other?s result. More specifically most multi-recognizer research 
efforts focus on how to select the right words in certain regions of recognition results 
where a majority of recognizers don?t clearly agree. 
 
This research is an extension of distributed listening [Gilbert 2005], which consists of 
multiple recognizers with their own individual microphones (called the ?listener?) and 
one system (called the ?interpreter?) that collects recognition results of listeners and tries 
to deliver a correct result. This paper will talk about the results of the preliminary 
experiment data analysis of distributed listening, suggest the interpreter design based on 
the analysis, and implement and test the suggested interpreter design. The data analysis 
revealed that it is very possible that a majority of recognition results returned by listeners 
contain a large number of the same incorrect words (common misrecognition) which 
make the interpreter unable to return positive results. The suggested interpretation design 
includes a normalization process to clean and reformat strings, a pattern matching process 
to examine and separate common information and uncommon information found in the 
multiple recognition results, several common misrecognition (CMR) treatments to 
resolve the common misrecognition problems found in the analysis part, and a selection 
process to choose a correct result when multiple matching results of sentence recognition 
were returned. 
CHAPTER 2 
LITERATURE REVIEW 
 
[Barry et al. 1994] is one of the early researches using multiple speech recognizers. As 
shown in figure 2 one microphone input was shared by three computers (recognizers) and 
one master computer (RS-232) receives recognition results from each of the three 
recognizers, and then selects one with clear majority. The experiment was single-word 
speech recognition test and two sets of 20 and 25 words were used. 
 
 
Figure 2: Hardware configuration used in the experiment of [Barry et. al. 1994] 
 
The selection algorithm was very simple. When there is a word with clear majority, the 
word is selected as a correct recognition result. For example, if the master receives two 
identical words and one other, the common words is selected. If it receives one word and 
5 
two invalid or no response, the word is selected. If there is no word with clear majority 
such as three different words, or no responses, the master received the second best words 
from the three recognizers, and then the new words participated in the competition with 
equal weight. Figure 3 is one of the experiment results and it clearly shows that the 
combined overall result (EMR) is better than any of the three individual recognizers. 
 
 
Figure 3: The result of thee recognizer system 
 
As stated in the paper, the performance of ITT is much superior to the two other 
recognizers (Votan and TI) and the overall result (97.3%) is just slightly higher than the 
result of ITT (96.7%). So there is some possibility that the outstanding overall result is 
due to the superior performance of ITT. [Barry et al. 1994]  
 
This early research testified that the use of multiple recognizers can improve the overall 
word recognition rate. The method of choosing a word with clear majority became the 
very fundamental idea. The algorithm to select a right word was frequently modified and 
6 
improved upon to yield better results in the word level selection processes in many of the 
later research efforts using multiple speech recognizers. 
 
Recognizer Output Voting Error Reduction (ROVER) 
Recognizer Output Voting Error Reduction (ROVER) system [Fiscus 1997] used an 
alignment module to align multiple recognition results and then a voting module to form 
a composite output of words with higher voting scores as seen in figure 3.  
 
Figure 3: ROVER system architecture 
 
 
Figure 4: The progress of forming WTN 
 
The alignment module aligns multiple speech recognition results into a single composite 
output called word transition network (WTN) by using dynamic programming (DP) 
method. Figure 4 shows the alignment progress. The first input (WTN-1) is chosen as a 
base WTN and then the second WTN is aligned with the base. The third picture of figure 
7 
4 shows a new base WTN. And this progress is repeated until there is no more speech 
recognition input to combine. As stated in the paper, the final form of the composite 
WTN is not always an optimal WTN because the combining order of WTNs affects the 
composite WTN and may result in different composite WTNs. [Fiscus 1997] 
 
 
Figure 5: An example of a composite WTN 
 
One error!
The result of ROVER system:
The correct target:
 
Figure 6: The result of voting process 
 
Figure 5 is an example of a composite WTN formed by aligning five outputs. Each 
column is an independent voting section and the word with highest score in each section 
is selected as a correct word. The final result produced from the composite WTN by 
ROVER system (Figure 6) shows that it has one error, which is much better than the least 
number of errors, 3, the best single system has.[Fiscus 1997] In the experiment using 
submissions of LVCSR 1997 Hub 5-E evaluation (administered by NIST), ROVER 
system reduced the word error rate from 44.9% (the rate of the best single system) to 
39.4% [Fiscus 1997]. In the experiment using nine systems conducted by NIST, ROVER 
was able to reduce the word error rate from 13.5% to 10.6% [Pallett et al. 1999]. 
8 
[Schewenk and Gauvain, Sept 2000] improved ROVER system by adding 
normalization/filtering process and using language model information. One of the 
problems the original ROVER system had was that the order of combining outputs in the 
alignment module affected the result of the aligning process. The paper suggested that it 
is better to use the output from the best recognition system first and then combine the rest 
in the descending order of recognition rate. Figure 7 shows that the error rate did not 
change much and sometimes even increased when 5 to 9 systems were combined. 
Therefore it was suggested that combining systems with high error rate may not improve 
the performance. [Schewenk and Gauvain, Sept 2000]  
 
Original ROVER
Improved ROVER with
arbitrary tie break
Improved ROVER with
tie break using Language Model
 
Figure 7: Word error rates in function of the number of combined system 
 
9 
The normalization/filtering process is a process to change phrases with alternative 
spellings to one common form. For example, ?cannot?  ? ?can not?, ?child?s?  ? 
?child?s? or ?child is? or ?child has?. In the improved ROVER system, only simple one-
to-one filtering was used because of technical difficulty, and a slightly improved result 
(10.1% to 10.0%) was produced when outputs of 7 recognizers were combined. Another 
problem of the original ROVER system is that some words had the same voting scores 
(called ?ties?) after alignment and the ties were arbitrarily broken in the system. If the 
correct words are ideally selected every time a tie is encountered instead of breaking 
them arbitrarily, the word error rate may drop below 5%. So in the improved ROVER 
system, language model (LM) information was used as a tie breaker in order to select 
correct words in tie situations. Figure 7 shows that tie breaking using LM produced better 
results when combining up to the 4
th 
recognizer, but almost same or even worse results 
when combing 5 to 9 recognizers. Table 1 from [Schewenk and Gauvain, Oct 2000] 
shows the relative improvement rate to the rate of the best recognizer achieved the 
highest (20.5%) when three best recognizers were combined. These result suggests that 
the use of LM for breaking ties can improve the word error rates but the number of 
recognizers combined may need to be restricted and recognizers with high error rates 
should be excluded. [Schewenk and Gauvain, Sept 2000] 
 
Table 1: Relative improvement rate with respect to the best single recognizer (17.1%) 
 
10 
11 
ROVER is a very popular speech recognition system using multiple recognizers, which 
was frequently referenced in many other related research efforts to compare with their 
research results. ROVER introduced and tried many useful processes to improve 
recognition results, such as alignment, voting, normalization/filtering, and use of 
language model for tie break, and successfully proved that most processes can lower the 
error rates. It also proved that combining a restricted number (3 or 4) of the best 
recognition results is better than combining all recognition results including unreliable 
ones with high error rates. 
 
[Cristoforetti et al 2003] used multiple speech recognizers to achieve robust speech 
interaction in vehicles. One input source (one microphone) was shared by recognizers as 
shown in figure 8 and each of the recognizers carried different sets of vocabularies and 
language model for specific narrow domains such as tourist information, geographic 
information, etc, instead of carrying one large vocabulary and language model replicated 
over each recognizer in order to reduce complexity. In contrast with ROVER system, the 
system selects one best recognition result with the highest possibility, instead of making a 
composite result out of all the recognition results returned by the recognizers. Another 
characteristic of this research is that a corpus of speech interactions which can happen in 
a vehicle was gathered in simulated driving situation applying the Wizard of Oz method 
(WOZ), and it was used to populate the sets of vocabularies and language models of each 
recognition unit. [Cristoforetti et al 2003] The result of the experiment in this research 
shows that word recognition rates were improved with respect to the rate of recognizers 
carrying information for a whole domain by 3.7% (58.5%  ? 62.2%) for closely located 
microphone inputs, and 2.3% (45.7%  ? 48.1%) for far located microphone inputs. 
Sentence recognition rates were improved by 5.6% (29.4%  ? 35.0%) and 6.9% (29.2% 
 ? 22.3%) respectively. Therefore this research revealed that there is a possibility that 
distributing vocabularies and language model among recognizers instead of making each 
one carry all of the information may greatly improve overall sentence recognition rates. 
 
 
Figure 8: Parallel recognition units 
 
Distributed Listening (DL) 
[Gilbert 2005] introduced a detective story as a good analogue of multiple speech 
recognizer systems to develop a better architecture design in conjunction with their 
similarity. A crime is committed and it is witnessed by several people. A detective starts 
the crime investigation by interviewing the witnesses. Each of the witnesses gives the 
detective slightly different stories because they were in different places and perceived the 
event differently. The detective tries to develop one complete scenario based on the idea 
that common pieces of witnesses are more likely to be right. And incomplete parts are 
resolved by the experience and knowledge of the detective. This story can be restated in 
terms of speech recognition as described next. 
 
12 
An utterance is spoken and it is recognized by several recognizers. A process starts the 
speech examination by gathering information from the recognizers. Each of them gives 
slightly different results because they were in different places and recognized the speech 
differently. The process tries to develop one composite result based on the idea that 
common pieces in each recognition result are more likely to be correct. And incomplete 
parts are resolved by the intelligence and information the process has. 
 
As an analogy of independent witnesses in crime scene, multiple speech recognizers with 
their own microphones were used and named ?Listeners?. As a detective, a system 
examining recognition results returned by each listener is named ?Interpreter?. It was 
suggested that it would be more beneficial for each listener to have its own input source 
because bad input will be given to all the listeners with only one shared microphone 
when a good signal isn?t received. 
 
 
Table 2: Word recognition rates of DL preliminary experiment 
13 
14 
Table 2 shows the word recognition rates of the preliminary experiment with two 
listeners of distributed listening. 46 subjects (16 females, 30 males) participated in the 
experiment and one of them was excluded in the counting because it is an outlier. The 
tasks given to each subject was to dictate 3 2-sentence paragraphs taken from a book. 
Totally 135 pairs (45 x 3) of speech recognition results returned by two listeners were 
collected during the experiment. The combined result is the percentage of correct words 
when the recognition results of listener 1 and listener 2 are ideally combined. The 
average improvement with respect to the average rate of the best recognizer (listener 2) is 
9.0% (79.51%  ? 88.55%). This improvement rate suggests that distributed listening 
systems may collect all the correct words in a sentence when the results of 4 or slightly 
more listeners are combined. And it may also closely support the number of recognizers 
suggested by [Schewenk and Gauvain, Oct 2000]. 
 
Many topics regarding multiple speech recognizer research were briefly described here. 
Most researchers tried to answer and resolve the following questions: Should we use one 
single input (microphone) or individual input source for each recognizer? Are 
normalization/filtering processes useful? Should we use same recognizers or different 
recognizers? Choose a best recognition result or make a composite one? How many 
recognizers need to be used? Is there any aligning method which is unaffected by 
combining order? Is a word (or phrase) agreed by a majority of recognizers always a 
correct one? Can a collection of sentences (corpus) users may speak be useful? Some of 
them were well proved but some are still in question. 
 
15 
The ultimate objective of speech recognition research using multiple recognizers is to 
increase sentence recognition rates up to usable rates. Many researchers showed that 
there were increases in word recognition rates but sentence recognition rates are still very 
low though improved by combining multiple recognition results. The very fundamental 
idea supporting multi-recognizer research is that information agreed by a majority of 
recognizers is more likely to be correct. However, it may not always be true because 
there are so many similar words in English vocabularies and their pronunciations are very 
easily affected by surrounding noises and intonation of speakers. Therefore there is very 
high motivation to clarify the possibility of the existence of common misrecognitions, 
improve how to utilize the advantages of using multiple recognizers or find other ways to 
make usable multi-recognizer systems even when sentence recognition rates are very low.  
16 
CHAPTER 3 
INITIAL STUDY 
 
3.1 Analysis of preliminary experiment data of Distributed Listening (DL) 
Table 3 shows the analysis results of the preliminary experiment data of Distributed 
Listening. Garbage words in table 3a are misrecognized words found in each result of 
listener 1 and 2, but not in both of them. Common misrecognition refers to misrecognized 
words commonly found in both results of listener 1 and 2. Missed words are commonly 
unrecognized words by both listeners. The average number of misrecognized words by 
both listeners is 6.9. This means that 2-listener distributed listening systems may have 
about 7 misrecognized words in every speech-recognition. More importantly it shows 
another finding that the number of commonly misrecognized words may increases by 3 
when one listener is added to the system. 3 words are about 10% of the average number 
of words (29.7 words) the three paragraphs contain. [Schewenk and Gauvain, Oct 2000] 
suggested that combining 3 recognizers may give the best benefit of using multiple 
recognizers, and the analysis of the preliminary experiment of distributed listening 
showed that all the necessary correct words may be gathered when 4 or slightly more 
listeners are combined. 
 
 
Table 3a
Table 3b 
 
 
Table 3: Analysis results of preliminary experimental data 
 
Another very important finding is almost 63% (table 3b) of recognition pairs returned by 
two listeners contain at least one commonly misrecognized word in both recognition 
results. And the average number of common misrecognitions per speech recognition is 
1.1 (0.8 + 0.3 (common misrecognition + missed words)). This means that there is a very 
high possibility that every speech recognition result returned by 2-listener distributed 
listening systems may averagely contain at least one commonly misrecognized word. 
This finding strongly supports the theory that recognition results agreed by a majority of 
recognizers in multi-recognizer system may not be always correct. Therefore if this 
17 
18 
important fact is not seriously taken into account in designing multi-recognizer systems, 
the sentence recognition rates may never be able to meet the usable level. 
 
3.2 Input Analysis 
Input analysis in this research is the process which inspects recognition results received 
from each listener and then returns two important outputs used in later processes. One is 
Common Words and Structure (CWS), and the other is Different Words list (DW). CWS 
contains sets of common words found in the recognitions results and markers (?***? + 
index of each set) for uncommon words sets found between common words sets. DW 
contains the sets of the uncommon words and their index is same as the postfix index of 
place markers in CWS which indicate where they were collected. All the sets and 
markers in CWS and DW are in the order found. Figure 9 illustrates the actual structure 
of CWS and DW. Simply, CWS contains common words and markers in the order found, 
and the DW list contains uncommon words sets found in the region the markers of CWS 
indicate. In other words, CWS carries common information and DW carries uncommon 
information. 
 
Common
Words and 
Structure (CWS)
***0  ? using your brain  ? ***1  ? free soon your  ? ***2  ? could ever pay  ? ***3  ? see things  ? ***4
one keep far work
mind will show way
you'll never see other
beyond knee ways making
far
***0 ***1
***2
***3
***4
Sets of
Different
Words (DW)
using your brain
19 
far mind will show way beyond knee you'll never see
one keep work ways making far other
free soon your could ever pay see things
using your brain free soon your could ever pay see things
2
nd
input
1
st
input
Input
Analysis
+
Common
Words
Region
(CWR)
CWRCWRCWR
+
 
Figure 9: Input Analysis 
 
Later in this research, CWS is used for finding a target paragraph among all the candidate 
paragraphs. If a candidate paragraph contains all the common words sets in CWS in order, 
then the candidate paragraph is considered as a matching target paragraph. It is possible 
that the matching process returns multiple target paragraphs, so a process selecting a best 
target paragraph is necessary. Uncommon words sets of DW are used for counting the 
number of matching words in uncommon words regions in candidate paragraphs and 
calculating the final matching rate of the paragraph. 
 
3.2.1 Pattern matching 
During the input analysis process, a pattern matching algorithm was used in order to 
detect and collect common words in the two recognition results returned by two listeners 
and separately collect uncommon words between common words regions. The use of the 
pattern matching approach provides ease of changing and using various patterns, 
flexibility of accommodating inputs from more than two listeners and much better 
readability of program codes.  
The length of uncommon words region (6)
The number of uncommon words in the pattern = 10
(the number of red packs)
The length of the pattern (8)
Output from listener 1
Output from listener 2
Common Words
 
Figure 10: 16-pack pattern for two inputs 
 
2-input 16-pack patterns (figure10) which can accommodate 8 words from each of the 
recognition results were used for detecting common words in the pairs of recognition 
results. Each row receives words from each recognition results. The length of the pattern 
(8) was chosen based on the analysis result that the longest segment of uncommon words 
in all the recognition results of the preliminary experiment data is 7. 
20 
 
Figure 11: 64 patterns for input analysis 
 
Figure 11 shows 64 patterns which are all the possible combinations that a pair of 
common words can be located in two 8-word phrases. The patterns are grouped by and 
then arranged in the order of the number of uncommon words in each pattern. And within 
each group, patterns are arranged in the ascending order of the length of uncommon 
words region. Therefore patterns with smaller number of uncommon words and shorter 
length of uncommon words region are used first in the pattern matching process. 
 
 
21 
 
 
 
Figure 12: The progress of pattern matching 
 
 
 
22 
23 
3.2.2 The progress of pattern matching 
(1) Begin the pattern matching process with the first 8 words of each of recognition 
results (or next 8 words if comes from step 3) by comparing patterns one by one. 
(*) If patterns run out without finding any match, then the pattern matching process 
terminates without returning any results. This means that the degree of misrecognition of 
either or both of the two inputs is severe. 
(2) If a pattern is matched (a pair of common words found),  
(2A) All the uncommon words (red packs in patterns, if any) are added Different 
Words (DW) list as a set. 
(2B) Place a marker for uncommon word region (if uncommon words found) in 
Common Words & Structure (CWS). (*** + index of the set added to DW list) 
(2C) keep checking the next pairs of word by increasing indices of both inputs 
together one by one until an uncommon pair is found. 
(2D) if an uncommon pair is found, all the common words found until then form a 
piece of CWS and saved, and then go to step (3). If reached at the end of both 
inputs without an uncommon pair found, then all common words found until then 
form the last piece of CWS and saved, and the whole pattern matching process 
terminates. 
(3) Go to step (1) and begin pattern matching process from the first pattern. 
(4) If the pattern matching process reaches the end of both of the inputs, the pattern 
matching process ends. If there is (are) uncommon word(s) found until then, then do step 
(2A) and (2B). Finally the pattern matching process returns Common Words & Structure 
(CWS) and Different Words (DW) list. 
24 
3.3 Domain-specific Simple text-corpus database 
A corpus is a big collected set of written or spoken texts normally used for linguistic 
research. A corpus is usually electronically processed and saved in order to make it 
more useful and add computerized efficiency for research.[Wikipedia 2006] The corpus 
used in this research is the collection of sentences or paragraphs speakers (users of 
distributed systems) may utter. The tasks given to the users during the preliminary 
experiment were to dictate 3 2-sentence paragraphs randomly collected from chapter 2 
of a book [Kiyosaki 2000]. Therefore the domain of the corpus will be the book and the 
corpus will consists of all the sentences (paragraphs) found in chapter 1 and 2 of the 
book. 
 
Target paragraphs in this research are the 3 2-sentence paragraphs the speakers actually 
spoke during the preliminary experiment. Candidate paragraphs are all other sentences 
(or paragraphs) in the corpus. In the early experiments in this paper, it will be tested to 
see how many common words & structures (CWS) can match their corresponding target 
paragraphs. The matching condition is true if a 2-sentence paragraph contains all pieces 
of a CWS in order. And then in later experiments, it will be tested how many of CWSs 
are able to correctly find their target sentences in the corpus (the targets are mixed with 
all other candidate paragraphs). If multiple target sentences are found for one CWS, 
then a selection process will compute a matching rate to determine the best matching 
target. 
 
25 
The corpus is a domain-specific simple text-corpus because the texts are collected from 
a specific domain which is a collection of written (not spoken) texts from a book, and 
electronically saved without any further processing. 
 
The reason why simple text-corpus database is used in distributed listening system is 
because sentence recognition rate is still too low even after combining results of 
multiple recognizers. Only one listener (0.7%) was able to fully recognize a paragraph 
correctly during the preliminary experiment and only 7 combined results (5.2%) contain 
all the words in their target paragraphs. Moreover, these recognition rates are possible 
only when it is assumed that the contracted and expanded phrase are considered same 
(for example, ?you?ll? = ?you will?) and the recognition results are ideally combined. 
Therefore the sentence recognition rate is almost zero. 
 
Therefore in this research, it is suggested and tested that the combined recognition 
results are compared with sentences in the collection of possible sentences users may 
speak. And the best matching sentence is selected as a right recognition result (target). If 
there are multiple matching sentences, then a selection process will be used to calculate 
matching rates of the sentences to choose the best matching sentence. 
 
3.4 Normalization 
The analysis of 135 pairs of speech recognition results collected from the preliminary 
experiment of Distributed Listening [Gilbert 2005] shows that there are many redundant 
white spaces, unnecessary special characters, and case sensitivity problems in the text of 
26 
the speech recognition data. The existence of redundant spaces between words and same 
words with different cases (lower or upper), for example, ?I?ll? and ?i?ll? might result 
from minor mistakes in design of grammar or improper setting of return text values for 
the corresponding recognized words. And unexpected special characters like a period or a 
single quotation mark in the place of an apostrophe might result from minor mistakes 
while moving recognition data for analysis. 
 
These mistakes can be considered very minor because the numbers are relatively small. 
But some kind of cleaning process is still required before the recognition results are sent 
to post-speech-recognition process, because they will negatively affect the string analysis 
and matching processes. An automated process is recommended for future experiments 
when transferring data from one medium to another. Furthermore, this result of the 
observation led to the conclusion that a same process must be performed for candidate (or 
target) paragraphs in a text-corpus database (a collection of candidate paragraphs) before 
they are involved in a matching process, because string analysis and matching must be 
done between words with same format. The format of text paragraphs saved in the 
database is more likely to be human-friendly (written in normal writing format), or has a 
possibility that it is not well formatted. 
 
In this research, the cleaning and reformatting process is called ?normalization?. 
Normalization changes all characters to lower case, removes all the redundant white 
spaces between words, remove all special characters except an apostrophe in contraction 
forms, for example, ?they're? and replace a single quotation mark with an apostrophe. 
27 
These kinds of unexpected minor character problems will be very likely especially when 
the speech applications are multimodal systems which accept inputs from both speech 
and keyboard. For accurate and efficient string processing, a similar kind of 
normalization which makes text inputs well formed is very necessary for applications 
using speech recognition and natural language processing. 
 
3.5 Experiment 1 
In this experiment (illustrated in Figure 13), the pairs of speech recognition results 
returned by two listeners will be processed through normalization before sending them to 
the input analyzer (pattern matching) process. And then after input analysis, it is noted 
how many of the CWSs match their corresponding target paragraphs. Target paragraphs 
here mean the paragraphs the speakers actually spoke. If all the pieces of a CWS returned 
by the input analyzer are found in a candidate in order, the candidate is considered as the 
matching target. All of the 135 pairs of speech recognition results from the preliminary 
experiment will be used in the experiments. 
 
 
Figure 13: Experiment 1 setup 
 
3.5.1 The result of experiment 1 
After adding the normalization process into the post-speech-recognition (PSR) process, 
32 (27.8%) out of 135 CWSs were able to correctly detect their target paragraphs. The 
treatments included in the normalization seem to be very basic and trivial but they are 
very essential processes for correct and efficient string matching processes which will be 
performed later. When the normalization process is turned off, even though advanced 
CMR treatments were on which will be followed in this paper, the matching process 
28 
29 
simply returned 0% of matching rate. Therefore it can be said that the normalization 
process increased zero recognition rate up to 27.8%. 
30 
CHAPTER 4 
IMPLEMENTATION AND EXPERIMENTS 
 
4.1 Common misrecognition (CMR) 
Common misrecognition (CMR) is commonly misrecognized words by both (all) 
listeners. The CMRs found in the analysis of the preliminary experiment can be classified 
into 6 categories: Contractions (CT), general Misrecognized words (MRW), Missed 
words (MW), Stop words (SW), Homophones (HMP), and Combined misrecognitions 
(CBMR) as seen in Table 4. 
 
4.1.1 Contractions (CT) 
4.1.1.2 Contractions in English 
A contraction is a reduced form of usually two (rarely three) words. Mostly the first word 
is a pronoun and the second word is an auxiliary verb, or combination of auxiliary verb 
(or be verb) and negation (not). For example, "you'll" is the contraction form of "you 
will", and ?aren?t? is the contraction form of ?are not?. An apostrophe (') is inserted 
between the words of contraction forms. Contractions form a different word from their 
former words but they are same things. Contractions are more likely to be used in spoken 
language than in written language. [Wikipedia 2006] 
.
 
 
Table 4: Common misrecognition collection (CMR) 
31 
32 
4.1.1.3 Contractions in CMR 
26.9% of CMRs found in the whole CWS collection are related to contraction problem. 
?you will? was recognized as ?you'll? in 51 pairs of recognition results (26 from 
paragraph 1, 25 from paragraph 3), and ?that?s? was recognized as ?that is? in one 
recognition results (paragraph 2). This is the second largest category of CMR collection. 
The manner in which this problem is treated can contribute to significant increase of 
recognition results 
 
4.1.1.4 Contraction or expansion? 
It is difficult and sometimes ambiguous to expand contraction forms to their original 
forms (?you?ve?  ? ?you have?), because many contraction forms are shared by two 
different combinations of words. For example, the contraction form of ?I shall? and ?I 
will? is ?I?ll? and the contraction form of ?he has? and ?he is? is ?he?s?. To expand 
contractions which have two possible expended forms is not simple. The first step to 
expand them can be to check the following words and choose the right one according to 
predefined rules. For example, ?he?s going home? can be easily transformed to ?he is 
going home?. But ?you?ll have your money? can be either ?you shall have your money? 
or ?you will have your money?. The former is a promise and the later is a simple 
prediction. [The American Heritage 1996] To correctly expand a contraction form to 
either of two possible expanded forms, an advanced AI process which understands the 
context of situation or paragraph is required and sometimes it can be ambiguous when 
done even by humans. 
 
33 
Contrastively, it is much easier to contract words. By using a complete list of contraction 
forms and their expanded forms, two or three words can be simply replaced to their 
corresponding contraction form. Therefore, in this research the later method was applied 
and added as a later part of normalization process in order to increase the positive 
matching rate between CWSs and target paragraphs (or candidate paragraphs). All the 
normalized recognition results and normalized target or candidate paragraphs are 
processed through the contraction treatment which replaces all the word pairs found in 
the contractions list to the corresponding contraction forms. 
 
4.1.2 Misrecognized words (MRW) 
Misrecognized words are words commonly misrecognized by all listeners which do not 
belong to contractions, homophone, combined misrecognition, and missed words. Simply 
they are misrecognized words not belonging to any category. 25.6% of CMR collection is 
related to the general misrecognized words. 23 words are classified as general 
misrecognized words and many of them are commonly known stop words. 
 
4.1.3 Missed words (MW) 
Missed words are words commonly unrecognized by all listeners. So the listeners didn?t 
return anything for the speech utterances. 23.1% of CMR collection is related to the 
missed words. ?do?, ?and?, ?for?, ?in?, ?of?, ?that?, ?the?, and ?you? were commonly 
unrecognized (missed) by two listeners. The interesting thing is that all of them (98%) 
except one, ?do?, are commonly known stop words. 
 
34 
Missed words problem is handled by stop words treatment described in the following 
section. Missed words problem is resolved by removing stop words in target or candidate 
paragraphs. Because the words are missed and not present in recognition results, the 
removal of stop words in candidate paragraphs may increase the chances of positive 
matching results between CWSs and paragraphs. 
 
4.1.4 Stop words 
Stop words is a term usually used in the search engine area. It means insignificant words 
whose occurrence in a certain domain such as a database system, articles, etc is so 
frequent that they are excluded in search operations in order to save space and speed up 
the operations. For example, ?the?, ?is?, ?of?, ?and?, ?a? are very common stop words. 
[Wikipedia 2006] [Sullivan 2006] Every domain has its own stop words list because the 
context and words it has are different from others. 
 
The reason why stop words is introduced here is because the observation of CMR 
collection led to the inference that most of Missed words and Misrecognized words in 
CMR collection are commonly known stop words. The union of missed words and 
misrecognized words is the largest portion of CMR collection. The way in which these 
words are treated in conjunction with the similarity between the two different word 
collections (missed and misrecognized) and stop words may greatly improve the overall 
result of this research. 
 
35 
[Bray 2003] introduced how to statistically determine stop words of a certain domain. 
[Table 5b] is the stop words list brought from [Bray 2003]. 42.3% of missed words and 
misrecognized words are stop words in the list. But the frequency of the occurrence of the 
42.3% takes up 74.2% of the total frequency of missed words and misrecognized words. 
If a large portion of the words matches the stop words list from other domain, there is a 
high possibility that the percentages will go up when the stop words list from its own 
domain is used for the counting. Therefore a new stop words list was made using the 
statistical method presented in [Bray 2003]. 
 
The three target paragraphs used for the preliminary experiment of distributed listening 
[Gilbert, 2005] are three pairs of sentences brought from chapter 2 of ?Rich Dad, Poor 
Dad? [Kiyosaki 2000]. The book contains 57702 words comprising 4832 unique words. 
[Table 5a] is the stop words list of [Kiyosaki 2000] which was made using the simple 
statistical method described in [Bray 2003]. The 23 words in the list are just 0.5% of all 
the unique words but they take up about 33% of the total word occurrence of the book. 
42.3% of missed words and misrecognized words are found in the new stop words list 
and they take up 76.3% of the total frequency of missed words and misrecognized words. 
This is a smaller increase than expected. 
 
The stop words list (Table b) in [Bray 2003] is about three times larger in domain than 
[Kiyosaki 2000]. If a domain is much larger, its stop words list may contain more general 
and insignificant words which can be used in general. So in this research the union of the 
two stop words list was determined to be used for the experiment. It can be said that the 
36 
new mixed stop words list [Table 5c] is the union of a general stop words list and a 
domain-specific stop words list. The new list contains 35 words (8 from [Kiyosaki 2000] 
list, 12 from [Bray 2003] and 15 in common). 57.7% of missed words and misrecognized 
words are in the new mixed stop words list and they take up 84.5% of the total word 
occurrence of missed words and misrecognized words. Later in this research ?their? will 
be removed from the list because it is a combined misrecognition word (CBMR) which 
will be explained later, and ?far? is added because it was frequently misrecognized for 
?for? as an example of spoken stop words. 
 
4.1.4.1 Stop words treatment 
The treatment for the missed and misrecognized words determined to be used in this 
research is ?removal? of all the stop words as they were all ignored during search 
operations. All the stop words found in CWSs, DW lists and candidate paragraphs were 
removed before they are sent to matching process. 
 
4.1.4.2 Spoken stop words 
Stop words are not perfectly relevant to speech recognition research, but because of their 
strong similarity with common misrecognition words found in preliminary data collected 
from [Gilbert 2005], they were chosen to be used in this research as substitutes for 
common misrecognition words because there are no such data collections known to the 
public. 
 
 
Table 5b
Table 5a 
Table 5c 
 
Table 5: Stop words lists 
 
The characteristics of stop words are insignificance in search operations, very high 
frequency in given domains, mostly meaningless in contexts but words with grammatical 
roles. Additionally for speech recognition research, one more characteristic can be added, 
which is ?misrecognition-prone?. In English, words carrying very minor or no meaning 
but grammatical roles such as prepositions, articles, etc are usually spoken very short and 
37 
38 
unstressed to emphasize other important words such as nouns and verbs, etc in intonation. 
[Celik 2001] This factor may make speech recognizers easily miss or misrecognize such 
words. 
 
Consequently, a new term is introduced in this research, which is ?spoken stop words?. 
Spoken stop words can be defined as insignificant and misrecognition-prone words with 
very high frequency in a domain. Stop words make search operations stop, spoken stop 
words likewise make speech recognizers stop (misrecognize). 
 
It will be very helpful in increasing performance and accuracy of distributed listening 
systems to remove information like stop words which seem unnecessary or are incorrect. 
But if too much information is removed, there will be some trade-off. While the removal 
process increases the accuracy of earlier processes, it may damage the accuracy of later 
processes because the removal process also eliminates correct information. Therefore it is 
very important to collect correct spoken stop words lists for specific domains or 
recognizers and restrict the amount of information to be removed. 
 
4.1.5 Homophone (HMP) 
Homophones are words with the same pronunciation but different spellings and meanings. 
For example, ?write?, ?rite?, ?right?, and ?wright? are homophones. 7.8% of CMR 
collection is related to homophone problem. ?they?re? was recognized as ?their? which 
has same pronunciation as ?they?re? in 15 CWSs (14 from paragraph 2 and 1 from 
39 
paragraph 3). ?they?re? has same pronunciation as ?there? and ?their?, therefore it is very 
likely that speech recognizers recognize ?they?re? as one of the two words. 
 
4.1.5.1 HMP treatment 
In this research, when a piece of CWS doesn?t match with a region of a candidate 
paragraph during the matching process, a homophone treatment process is initiated to 
check the presence of homophones in the unmatched region. If the region has one or 
more homophones, the homophone treatment process brings the corresponding 
homophone set(s) from the homophone list, and then generates alternative phrases by 
replacing the word in the piece of CWS with its homophone siblings. The alternative 
phrases are used to match the previously unmatched region of a candidate paragraph. If 
no homophone is found or none of the alternative phrases matches the region, then 
combined misrecognition (CBMR) treatment process is initiated. A homophone list 
[Cooper 2001] consisting of 706 homophone sets (1529 homophones) were used to 
populate the HMP database. 
 
4.1.6 Combined misrecognition (CBMR) 
Combined misrecognitions (CBMR) are single words recognized for a phrase consisting 
of two words. 17.6% of CMR collection is related to CBMR problem. ?i?ll? and ?soon 
your? were misrecognized as ?all? and ?sooner? respectively in 18 CWSs. And ?of there? 
and ?for the? were misrecognized as ?other? in 16 CWSs. An observed characteristic of 
CBMRs is that the pronunciations of partial parts of CBMRs are similar to the 
pronunciations of partial parts of the corresponding target phrase. For example, ?sooner? 
40 
(CBMR of ?soon your?) has a full pronunciation of ?soon? and the last part (?r?) of the 
pronunciation of ?your?. 
 
4.1.6.1 CBMR treatment 
In the similar manner with the homophone (HMP) treatment, when a piece of CWS 
doesn?t match with a region of a candidate paragraph even after HMP treatment during 
matching process, a CBMR treatment process is initiated to check the presence of CBMR 
in the unmatched region. If a possible CBMR is present in the region, the CBMR 
treatment process fetches the corresponding CBMR set(s) from CBMR list, and then 
generates alternative phrases by replacing the CBMR in the piece of CWS with its 
CBMR sibling(s). And then the alternative phrases involve in matching process. If none 
of the alternative phrases matches the region during CBMR process, another process is 
invoked which makes all the possible combinations of homophones and CBMRs gathered 
in the previous processes and then produces another list of alternative phrases by 
replacing homophone or CBMR words with the corresponding siblings in each 
combination. 
 
If the region didn?t match during the previous three processes or contains neither a 
homophone nor a CBMR, the region of a candidate is considered unmatched. Then the 
matching process shifts to the next region by skipping the first word and adding the word 
right next to the region to the end. The shifting continues until the matching process finds 
a matching region in a candidate paragraph or the process reaches the end of the 
41 
paragraph. If a region is matched, the skipped words until then form an uncommon words 
region (UWR). 
 
4.2 Matching process 
During matching process, all the pieces of a CWS are compared with candidate sentences 
(paragraphs). If a candidate sentence (paragraph) contains all the pieces of a CWS in 
order, then the candidate is considered as its corresponding matching target. Figure 14 
illustrates the matching process. Regions matching pieces of a CWS is called a common 
words region (CWR) and all other regions of a candidate (or target) are called uncommon 
words region (UWR). Words in UWR are compared with words in a corresponding set of 
different words (DW) list later during selection process, when there are multiple target 
sentences (paragraphs) returned by matching processing. 
Figure 14:
 Matching process 
 
42 
43 
4.3 Experiment 2 
Figure 15 illustrates the setup of experiment 2. In experiment 2, it is tested that how many 
of CWSs can correctly match their target paragraphs after adding common 
misrecognition (CMR) treatments (contraction (CT), stop words removal (SW), 
homophone (HMP) and combined misrecognition (CBMR) treatments) to the experiment 
1 setup. Contraction treatment is added as a later part of normalization and stop words 
removal processes are added right after input analysis of recognition results and right 
after normalization of candidate paragraphs. HMP and CBMR treatment processes are 
added to the matching process. 
 
A stop words removal process for speech recognition results is added after the input 
analysis because it helps preserve more information. If stop words are removed before the 
input analysis, it affects the result of input analysis. More specifically it changes the 
structure of CWSs in some cases. When words in a set of different words (DW) list are 
all stop words, they will all be removed during the removal process. If they are removed 
before the input analysis, the two adjacent pieces of CWSs become one piece because 
there are no words between them. If they are removed after the input analysis, the 
position of the set of DW is still preserved even though all the words in it are removed 
and the two adjacent pieces of CWS stay separated. The result of experiment 2 proves 
that stop words removal performed after input analysis affected about 11% of positive 
matching result. 
 
 
 
Figure 15:
  Experiment 2 setup 
44 
4.3.1 The result of experiment 2 
118 out of 135 CWSs (87.5%) were able to correctly match their target paragraphs and 
none of them matched any other paragraphs as their target. This is 369% improvement 
with respect to the result (32) of experiment 1. 86 more CWSs were able to match their 
target with assistance of advanced CMR treatments (contractions, stop words, 
homophones, and combined misrecognitions).  Their contribution rates are listed in Table 
6. 
 
 
Table 6: Contribution rates of each CMR treatment 
 
The highest contribution rate (about 70%) of stop words treatment proves that speech 
recognition systems greatly suffer from spoken stop words problem, and how they are 
dealt with will significantly improve recognition rates of their systems. The very high 
contribution rate of contraction treatment indicates that contraction problem is very 
common in speech recognition systems and the treatment should be used as one of 
essential processes in speech recognition. And the rates of homophone (HMP) and 
combined misrecognition (CBMR) treatment show that the treatments are very useful 
though their improvement rates are not as great as rates of CT or SW treatments. HMP 
and CBMR treatments can be considered as more positive and advanced process than 
45 
46 
other processes because no information is removed like SW process and incorrect 
information is corrected based on cumulated information. 
 
4.4 Experiment 3 (Overall design of interpreter) 
Figure 16 illustrates the setup of experiment 3. The corpus database and a selection 
process are added to the setup of experiment 2. The corpus database will feed candidate 
paragraphs to the matching process and the selection process will select one target 
paragraph by computing matching rates when matching process returns multiple target 
paragraphs. The candidate paragraphs are supplied to the matching process through 
normalization (+ contraction) treatment and stop words removal processes.  
 
In experiment 3, two sets of paragraph collections will be used. The first one is a 
collection of 2-sentence paragraphs collected from chapter 1 and 2 of [Kiyosaki 2000]. 
First, single sentence are extracted and then too short (< 4) or too long sentences (> 45) 
are excluded. 4 is rounded half of the number of words in the shortest sentence (7) of the 
three target paragraphs and 45 is the sum of the number of words in the longest paragraph 
(35) and the difference between the longest paragraphs (35) and the shortest one (25). 
The order the sentences were written in is preserved because they are all semantically 
related and the written order of the three paragraphs are also preserved. 420 single 
sentences are collected including sentences of the three target paragraphs. They are 
paired up to make 2-sentence paragraphs. The sentence in the order of ?A B C D E G H 
?? forms 2-sentence paragraphs of  ?A+B B+C C+D D+E E+F F+G G+H ??. So 
finally, 839 2-sentence paragraphs are created. The second collection is 1000 2-sentence 
47 
paragraphs collected from [Sykes and MeGregor 2001] in the same way used for the first 
collection of 2-sentence paragraphs. The second collection will be added to the corpus 
database as additional candidate paragraphs in later test. First it will be tested how many 
CWSs can correctly match their targets and how many of them match multiple 
paragraphs, when the targets are mixed with the first corpus (836 paragraphs), and then 
the same experiment will be repeated after adding the second corpus (1000 paragraphs) 
as a paragraph collection from other domain to the first one. Therefore the three target 
paragraphs are mixed with 1836 other paragraphs. 
 
The selection process calculates the matching rates of paragraphs to distinguish the best 
matching target when multiple targets are returned by the matching process. Figure 17 
illustrates the calculation of matching rate. Common misrecognition (CMR) processes are 
also involved in the matching rate calculation of the selection process. When there are 
words unmatched in uncommon words region (UWR) of target paragraphs even after 
word matching, it is checked whether or not the remaining words are homophone (HMP), 
or a part of contraction. If the remaining word is a homophone, the HMP treatment 
process checked if there is a sibling of the homophone in the corresponding set of 
different words (DW) list. If its HMP sibling is found, the remaining word is considered 
matched. If the remaining word is a part of contraction (for example, ?you? is a part of 
?you?ll?) and there is its corresponding contraction in the set of DW list, then it is 
counted as one matched.  
 
 
48 
4.4.1 The result of experiment 3 
The result of experiment 3 using the first corpus is exactly same as the result of 
experiment 2. This means that even though the three target paragraphs were mixed with 
836 other paragraphs, exactly the same 118 CWSs were able to correctly find their target 
and none of them matched any other paragraphs as their target (no multiple targets). 
Therefore the selection process was never used. And the second test after adding the 
second corpus yielded exactly same result as the first test of experiment 3. 
 
Figure 16:
 Experiment 3 setup (Overa
ll design of interpreter) 
49 
Figure 17: 
Matching process 
50 
Figure 18: 
Overall inte
r
p
reta
tion p
r
ogress 
51 
52 
CHAPTER 5 
CONCLUSIONS 
 
For more than a decade, a significant number of research efforts were made to improve 
speech recognition rates by using multiple speech recognizers. The purpose of using 
multiple speech recognizers is to make a best composite output out of several incomplete 
recognition results by complementing each other?s result. Much of the research resulted 
in positive improvement in word and sentence recognition rates but the improved rates 
are still below usable level. 
 
The analysis of the preliminary experiment data of distributed listening revealed that all 
the correct words of a speech may be collected when 4 or slightly more listeners are 
combined, and proved that words agreed by all listeners (speech recognizers) are not 
always correct. About 63% of combined recognition results contain at least one common 
misrecognition. 
 
In this research, several treatment processes were used in order to overcome the common 
misrecognition (CMR) problems in multi-recognizer systems, and a corpus database was 
used as an example of a collection of sentence users may speak in order to overcome low 
sentence recognition rates by searching the best matching sentence in the database.  
53 
The highest contribution rate (about 70%) of stop words treatment proves that speech 
recognition systems greatly suffer from spoken stop words problem, and removing the 
insignificant words can greatly improve speech recognition rates. The very high 
contribution rate (58%) of contraction treatment indicates that the contraction problem is 
very common in speech recognition systems. And the rates (16% and 30%) of 
homophone (HMP) and combined misrecognition (CBMR) treatment show that the 
treatments are very useful. HMP and CBMR treatments are positive and advanced 
process than other processes because no information is removed like SW process and 
incorrect information is corrected based on cumulated information. 
 
The normalization process which removes redundant information and reformats string 
information is very essential, because it enhances accuracy and efficiency of string 
comparison and matching tasks in post-speech recognition processes. The use of simple 
text-corpus, which is a collection of sentences (paragraphs) user may speak, is a very 
crucial part in this research. It was proved that common words & structure (CWS), which 
is a composite of information made out of multiple recognition results, was able to 
correctly match their corresponding target paragraphs even when mixed with about 800 
other paragraphs from the same domain and 1000 paragraphs from other domain. This 
proves that the composite information is enough to distinguish its correct target 
information. Therefore, these results strongly support that the use of a collection of 
sentence users may speak is a good way to overcome low sentence recognition rate in 
speech recognition systems. 
 
54 
The pattern matching approach to combine multiple recognition result was used in this 
research. The approach effectively collected and separated common (CWS) and 
uncommon (Different Words) information out of multiple recognition results. Only 3 
(2.2%) of the CWSs were affected by the combining order of recognition results and the 
changed structure didn?t affect the overall matching result. 
 
In addition to the detective story introduced in distributed listening [Gilbert 2005], CSI 
(Crime Scene Inspector) teams which assist the detective with advanced skills and 
Scenarios built upon based on their collective experience and knowledge can be added to 
the story as analogies of the CMR (common misrecognition) treatments and the simple 
text-corpus. 
 
Mainly normalization, pattern matching, common misrecognition (CMR) treatment and 
simple text-corpus were used to improve sentence recognition rates of multi-recognizer 
systems in this research. Overall the results (87% of sentence recognition rate) of the 
experiments strongly support that the processes can greatly improve speech recognition 
rates of multi-recognizer systems. 
 
55 
BIBLIOGRAPHY 
 
Barry, T., Solz, T., Reising, J. and Williamson, D., The simultaneous use of three 
machine speech recognition systems to increase recognition accuracy, In Proceedings 
of the IEEE 1994 National Aerospace and Electronics Conference, vol.2, pp. 667 - 671, 
1994. 
 
Bray, T.  On Search: Stopwords [Online] Available 
http://www.usa.net/~vinced/home/better-writing.html, July 2003 
 
Celik, M., Teaching English Intonation to EFL/ESL Students [Online] Available 
http://iteslj.org/Techniques/Celik-Intonation.html, The Internet TESL Journal, Vol. VII, 
No. 12, December 2001. 
 
Coopers, A., Alan Cooper?s Homonyms [Online] Available 
http://www.cooper.com/alan/homonym.html, Alan Cooper, 2001 
 
 
 
56 
Cristoforetti, L., Matassoni, M., Omologo, M. and Svaizer, P., Use of parallel 
recognizers for robust in-car speech interaction, In Proceedings of the IEEE 
International Conference on Acoustic, Speech, and Signal Processing [ICASSP 2003], 
Hong-Kong, 2003. 
 
Deng, L. and Huang, X.,  Challenges in adopting speech recognition, Communications 
of the ACM, vol. 47, no. 1, pp. 69-75, January 2004. 
 
Fiscus, JG., A post-processing system to yield reduced word error rates: Recognizer 
Output Voting Error Reduction (ROVER). Proc IEEE Workshop on Automatic 
Speech Recognition and Understanding, p 347?354, 1997. 
 
Gavrilov, L. A. and Heuveline, P., Aging of Population [Online] Available 
http://longevity-science.org/Population_Aging.htm, The encyclopedia of population, 
New York, Macmillan reference USA, 2003. 
 
Gilbert, J. E., Distributed Listening Research. In SpeechTEK West: Proceedings of 
AVIOS Speech Technology Track (pp. 1 ? 10). San Francisco, California, 2005. 
 
Kiyosaki, R. T. and Lechter, S. L., Rich Dad, Poor Dad: What the Rich Teach Their 
Kids about Money - That the Poor and Middle Class Do Not, Warner business books, 
April 2000. 
 
57 
Larsen, L. Bo., Br?ndsted, T., H., Dybkj?r, Dybkj?r, L., Music, B. and C. Povlsen., 
Spoken Language Dialog Systems, Report 1, September 1992. 
Pallett, D. S., Fiscus, J. G., Garofolo, J. S., Martin, A. and Przybocki, M., 1998 
broadcast news benchmark test results: English and non-English word error rate 
performance measures. In DARPA Broadcast NewsWorkshop, Hernon, VA, February 
1999. 
 
Pinker, S., The language instinct (New York: Morrow, 1994) 
 
Schwenk, H. and Gauvain, J., Improved ROVER using Language Model Information, 
In ISCA ITRW Workshop on Automatic Speech Recognition: Challenges for the new 
Millennium, Paris, pp. 47?52, Sept 2000. 
 
Schwenk, H., Gauvain, J., Combining multiple speech recognizers using voting and 
language model information. Proc 6th ICSLP, p 915?918, Oct 2000. 
 
Stevenson, B. and McQuivey, J., The aging of the US population and its impact on 
computer use [Online] Available 
http://www.microsoft.com/enable/research/agingpop.aspx, The Market for Accessible 
Technology?The Wide Range of Abilities and Its Impact on Computer Use, Study 
commissioned by Microsoft, conducted by Forrester research, Inc., 2003.
 
58 
Sullivan, D., What are stop words? [Online] Available 
http://searchenginewatch.com/showPage.html?page=2156061, SearchEngineWatch.com 
? the source for search engine marketing, 2006. 
Sykes, D. A. and McGregor, J. D., Practical guide to testing object-oriented software, 
Addison-Wesley professional, March 2001. 
 
The American Heritage? Book of English Usage, A Practical and Authoritative Guide 
to Contemporary English [Online] Available 
http://www.bartelby.net/64/C001/056.html, 1996. 
 
Wikipedia, the free encyclopedia, Contraction (grammar) [Online] Available 
http://en.wikipedia.org/wiki/Contraction_%28grammar%29, June 2006