USABILITY SIZE N Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory committee. This thesis does not include proprietary or classified information. __________________________ Andrea E. Williams Certificate of Approval: ___________________________ ___________________________ Cheryl Seals Juan Gilbert, Chair Associate Professor Associate Professor Computer Science and Software Computer Science and Software Engineering Engineering ___________________________ ___________________________ Peter Grandjean George T. Flowers Associate Professor Interim Dean Health & Human Performance Graduate School USABILITY SIZE N Andrea E. Williams A Thesis Submitted to the Graduate Faculty of Auburn University in Partial Fulfillment of the Requirements for the Degree of Master of Science Auburn, Alabama August 4, 2007 iii USABILITY SIZE N Andrea E. Williams Permission is granted to Auburn University to make copies of this thesis at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. ________________________ Signature of Author ________________________ Date of Graduation iv VITA Andrea E. Williams, daughter of James and Erma Williams was born on February 24, 1983 in Ft. Meyers, FL. She graduated from Columbus High School with honors in 2001. She attended Spelman College in the fall of 2001 and received a Bachelor of Science in Computer Science in May 2005. The following fall she attended Auburn University where she is currently working towards her Ph.D. v THESIS ABSTRACT USABILITY SIZE N Andrea E. Williams Master of Science, August 4, 2007 (B.S., Spelman College, May 2005) 41Typed Pages Directed by Juan E. Gilbert In today?s software development environment building a usable, customer satisfactory product is key to the success of a business. User satisfaction and usefulness are measured using usability studies which involve potential customers. During the politics of software development and delivery however, having to conduct usability studies can become a costly expense in the overall budget. This can cause problems because some managers would simply fall back on heuristic evaluations which are significantly cheaper using the developer as a tester and leaving out the real user, the customer. By using Applications Quest, a data mining clustering tool, we would like to see if given a population of size N is there a subset of N that would yield the same results as the larger population. If a company could use a smaller subset of N and get the same results, they could possibly stay on budget, on schedule, and save money. vi ACKNOWLEDGEMENTS First and foremost, I would like to thank the Lord above because it is through him I am able to do all things. He has guided me through life?s winding roads and I am still here. I would like to thank Dr. Juan Gilbert for all his support, encouragement, and patience. At times when I did not know how I was going to make it, he was a guide to the light. I would also like to thank my committee members for their support and words of wisdom. Their efforts in advising me and reviewing my work are forever appreciated. Thank you to my family for their love, support, and encouragement. They have always told me to follow my dreams and every time I have they have been right there behind me believing and pushing. Last but not least I would like to say thank you to my fianc? Justin. There were times when homework had to come first and he was right there with me googling for answers. At times when I felt alone, he showed up with treats in hand from Conyers, GA. Thank you so much! vii Style manual or journal used: Journal of SAMPE Computer software used: Microsoft Word 2003 viii TABLE OF CONTENTS LIST OF FIGURES .......................................................................................................... IX LIST OF TABLES ..............................................................................................................X 1. INTRODUCTION ...................................................................................................... 1 1.1 PROBLEM DEFINITION .................................................................................. 2 2. LITERATURE REVIEW ........................................................................................... 3 2.1 FIVE-USER ASSUMPTION.............................................................................. 3 2.2 FIVE USERS AND BEYOND............................................................................ 4 2.3 RANDOM SAMPLING...................................................................................... 5 2.4 HOMOGENEOUS VS. HETEROGENEOUS POPULATIONS ....................... 6 2.5 CONCLUSION................................................................................................... 7 3. USING APPLICATIONS QUEST AS AN APPROACH TO FINDING N SIZE ..... 9 3.1 WHAT IS APPLICATIONS QUEST?................................................................ 9 3.2 HOW IT WORKS? ........................................................................................... 10 3.3 DIVISIVE CLUSTERING: A POSSIBLE SOLUTION TO OUR PROBLEM 10 4. EXPERIMENT ......................................................................................................... 11 4.1 EXPERIMENT DESIGN.................................................................................. 11 4.1.1 DATA ....................................................................................................... 11 4.1.2 MATERIALS ............................................................................................ 12 4.1.3 PROCEDURE........................................................................................... 12 4.2 RESULTS ......................................................................................................... 14 5. CONCLUSIONS....................................................................................................... 26 5.1 CONCLUSIONS............................................................................................... 26 5.2 FUTURE WORK.............................................................................................. 27 REFERENCES ................................................................................................................. 28 APPENDIX A ................................................................................................................... 30 ix LIST OF FIGURES Figure 1: Curve showing relationship between problems found and number of users....... 4 Figure 2: Comparison of statistical differences for group size 5: Random Trial vs. Applications Quest............................................................................................................ 14 Figure 3: Comparison of statistical differences for group size 7: Random Trial vs. Applications Quest............................................................................................................ 15 Figure 4: Comparison of statistical differences for group size 13: Random Trial vs. Applications Quest............................................................................................................ 16 Figure 5: Comparison of statistical differences for group size 15: Random Trial vs. Applications Quest............................................................................................................ 17 Figure 6: Comparison of statistical differences for group size 20: Random Trial vs. Applications Quest............................................................................................................ 18 Figure 7: Comparison of statistical difference with revised algorithm for group size 5: Random Trial vs. Applications Quest............................................................................... 19 Figure 8: Comparison of statistical difference with revised algorithm for group size 7: Random Trial vs. Applications Quest............................................................................... 20 Figure 9: Comparison of statistical difference for group size 13:Random Trial vs. Applications Quest with revised algorithm....................................................................... 21 Figure 10: Comparison of statistical difference for group size 15: Random Trial vs. Applications Quest with revised algorithm....................................................................... 22 Figure 11: Comparison of statistical difference for group size 20: Random Trial vs. Applications Quest with revised algorithm....................................................................... 23 x LIST OF TABLES Table 1. Trial results from random selections................................................................... 30 Table 2. Trial results from Applications Quest (most similar algorithm) ........................ 31 Table 3. Trial results from Applications Quest (most different algorithm) ..................... 31 1 1. INTRODUCTION In the software development cycle, it is often the practice that developers will hold usability studies to test the accuracy and effectiveness of the software and to retrieve user response as to the satisfaction or usability of that software. In practice, usability studies can give developers insight into the mind of the user as well as unveil errors, major and minor, within the system. Of course, as with anything that involves users and studies, planning and budgeting to assess the cost of usability testing and users in the study must be done. Planning studies can be time consuming because activities such as designing studies, enlisting participants, and possibly implementing several runs of a study must take place. In planning, developers must consider different methods of usability, heuristic evaluation, and observation of tasks done in the study; these activities can become burdensome and intimidating to companies not familiar with this practice or not sure which practices will benefit their company most. Budgeting for studies within the development cycle is often a tug of war because although several tests might prove beneficial in the long run of the project scheme, in the short term the budget might not allow for testing at all or it might allow for a single test with a select number of participants. Often there are numerous problems with planning and budgeting that ultimately cause studies to either be drastically cut down in size or eliminated altogether. Determining the best factors for a study can be problematic because studies should be designed to fit the particular company, its size, and its goal for the study. Some 2 companies are not familiar enough with design specifications and often have to hire someone to implement their study or they neglect it altogether. In some cases they even implement their own study. In all cases, the outcomes can become costly if proper judgment is not used in selecting the type of study, the number of participants, the type of participants, or even the number of runs (trials) needed for that particular study. 1.1 PROBLEM DEFINITION The aim of this study is to show that using software, ApplicationsQuest, one aspect of designing a study, namely the selection of participants can be done effectively to 1) reduce costs and 2) maintain or improve result quality. The experiment conducted compares the results for two groups, those randomly selected and those chosen by ApplicationsQuest, and evaluates the significant difference of one group over the other. The rest of this thesis is organized as follows: Chapter 1 introduces the problem definition while Chapter 2 examines past and current methods for selecting the number of participants as well as which participants are chosen from the population of users. Chapter 3 gives a thorough description of ApplicationsQuest and how it is used in the context of this experiment. Chapter 4 discusses the design and results analysis and Chapter 5 presents conclusions and ideas for future work. 3 2. LITERATURE REVIEW In usability design when choosing the number of users for a study there is a debate about the number of participants to use; there is the five-user assumption, that says five users is all you need in a study, and then there is the idea that five users is not nearly enough because five users will not provide enough feedback about a product. Which idea is correct? In actuality there are a number of theories that claim to know the number of users for a study each saying that that number of users will provide a large percentage of accuracy and return most of the defects in the product. 2.1 FIVE-USER ASSUMPTION As discussed previously, usability studies can become expensive when it comes to designing and selecting users. Nielsen says, ?The best users come from testing no more than 5 users and running as many small tests as you can afford.? [2] According to the formula, Problems found (i) = N(1-(1-?)i) represented graphically below, one user should be able to uncover a third of the findings and as more users are added, redundancy occurs in the information. 4 Figure 1: Curve showing relationship between problems found and number of users Nielsen?s study showed that a group of five users were able to find about 80% of the findings in a system and as more users were added, there was less information to be found but more and more money was being spent to run tests and compensate additional users. The idea behind the assumption is that you can learn more from a group of five completing multiple tests than you would on fifteen participants completing one test. The study would yield more results and cost the same or less than the study with fifteen participants. Many usability professionals, because of this study done by Nielsen, only use four to five users in their study. [2] 2.2 FIVE USERS AND BEYOND Upon using the five-user assumption many usability professionals have found that five users are not enough. One study was done where five users were randomly chose 5 and only uncovered 35% of the findings while the 13th and 18th user uncovered results that the original missed. This shows that if the study had been discontinued at five those results would have been overlooked. In this study, users 6-18 were able to find other new results that the original five were unable to find which goes to show that if the right users are not chosen pertinent results can be left out. [1] In attempts to describe the confines of the five-user assumption, many professionals neglected the rest of the assumption that recommends running the subjects until the findings meet an ?acceptable level,? and instead adopted the most minimal number particularly five. To further examine the theory of five not being enough, Nielsen conducted another more structured study that took a population of sixty and randomly selected multiple groups of five or more. Each group?s findings were then compared against the findings of the entire population to measure how each group?s size affected data reliability, confidence and usability issues. The average percentage of findings by 100 trials of groups of five was 85% while the average percentage for any random group of five was 55-100%. Adding users increased the percentages, but the most important result showed that 55% was the minimum percentage for a group of five while a group of twenty produced a minimum percentage of 95%. [1] 2.3 RANDOM SAMPLING Random sampling is another method often used in usability studies. When properly done, random sampling contains no bias and can be relatively representative of the targeted population. [8] This method is also used because it requires no prior research or skill in selecting participants and is less expensive. Random sampling allows researchers to make generalizations about the majority of the population and those claims 6 can be justified by a certain level of certainty. [9] Of course as with any choice made surrounding a usability study and selecting users, companies must choose methods that most benefit their budget and the goal of their study. Samples are chosen in different ways such as simple random, systematic, weighted (quota), or convenience selection. In the case of simple random selection, participants are chosen from the entire group by the random selection of a unique identifier which can be drawn by hand like a tag drawn from a hat or mathematically selected by a computer program. Systematic selection is used by dividing the population into partitions and from each partition randomly selecting a participant. In some studies, particularly web based studies where companies are trying to target a specific user group; usability professionals give weight to that particular group so that they ensure their presence in the sample group. A convenience selection is simply as it sounds, the researcher randomly chooses participants that can be conveniently found. The participants may or may not be representative of the targeted group at all. Study results have demonstrated that random sampling can be problematic because you can never be 100% certain that the results from the selected sample are representative of the entire population. [8] Random sampling can also give you a false sense of security because in some usability studies the goal is not to find significant difference but more so to find insight into the usefulness of a particular product. 2.4 HOMOGENEOUS VS. HETEROGENEOUS POPULATIONS In the article ?Eight Users Is Not Enough,? authors Perfetti and Landesman found after trying to complete testing on an e-commerce site that the recommendations of four to five users with no more than eight was not enough. The first five users alone only 7 yielded 35% of the problems in the system; at that rate it would take them 90 tests to uncover the 600 problems in their system. The problem found with this study was that the usability professionals tried to apply a concept that did not quite fit their needs. E- commerce websites contain much more complexity in content versus software and simple websites, continuously and incrementally, change whereas software only changes in between a version which does not occur as frequently. With that discovery they also found that their users varied just as much as the complexity in their system. Their results showed that a sample group could not be used as a representation of the whole because each user that interacted with the system used the system differently. Understanding the type of product they were testing and their users, the authors were able to successfully learn what worked for their system. [5] 2.5 CONCLUSION When choosing the ?right? participants it is imperative that the users be representative of the population your product is trying to solicit. [3] As a sample of the entire group, gathering the relevant demographic information can prove to be helpful in differentiating between the results of individuals in the group. [4] Recruiting these representative participants is yet another timely and costly activity that creates an intimidation factor for potential usability professionals. Most professionals agree that testing should be done but some companies just do not have the capability or experience necessary to pull off small tests let alone multiple tests involving users within deadlines set for the project. On average, it is said to cost $107/user in a study depending on location and profession and that?s without a recruiting agency?s help. Companies who use 8 recruiting agencies must add additional fees while other companies must spend approximately 1.15 hours per person recruited man hours recruiting. [6] Even after choosing the ?right? participants it is important for practitioners to understand that there are variables within a study that they have varied control over. The types of participants a usability professional can find, the mission criticality of a system, or usability issues found posing a problem to a system have a deep impact on the number of users a study needs to still obtain accurate results. All things considered, there is still no dry cut way to select the number of participants to use in studies nor is there a way to select which user should be used or is most representative. A method that could help usability professionals minimize costs and test group sizes as well as maximize results would have a significant impact in usability design. 9 3. USING APPLICATIONS QUEST AS AN APPROACH TO FINDING N SIZE 3.1 WHAT IS APPLICATIONS QUEST? ?Applications Quest is data mining software that clusters admission applications based on holistic comparisons.? [7] The idea behind this software came from two landmark court cases, Grutter vs. Bollinger and Gratz vs. Bollinger, where two students challenged the University of Michigan?s admissions policies. Because of these cases the Supreme Court ruled that diversity could be used in admissions policies, but race could not be the determining factor for admission. It was determined that applicants? applications should be reviewed holistically and not based on a single attribute such as race or ethnicity. [7] The notion of holistically reviewing an application means considering each and every attribute of the application such that no single attribute weighs heavier than another. For admissions committees the action of holistically reviewing an application is time consuming and difficult because humans do not possess the ability to effectively compare attributes subjectively and with reproducibility of results. Applications Quest achieves the goal of holistically comparing applications and recommending applicants that represent diversity with diversity not being defined by race or ethnicity. Because the algorithm compares each application with the same rigor, the results are reproducible and justifiable. [7] 10 3.2 HOW IT WORKS? Incorporating computer science and information retrieval clustering algorithms, Applications Quest holistically compares thousands of applications one to another and places them in groups or clusters based on their holistic similarity. [7] The algorithm uses attribute-value pairs to compare each application, the more values each application has in common determines its placement in a cluster, this means that similar applications appear in the same cluster. With diversity in mind, each cluster is designed to hold similar applications but from each cluster the most different applicant is chosen. 3.3 DIVISIVE CLUSTERING: A POSSIBLE SOLUTION TO OUR PROBLEM Employing a divisive clustering algorithm, Applications Quest recommends applicants that are representative of diversity within an admissions applicant pool. Using this same software but modifying the context in which it is used, namely for participant selection in usability studies, could possibly help usability professionals select the most representative users of their targeted population. With the most representative test users selected by Applications Quest, usability professionals can save money and time on recruiting and weeding out studies completed by outliers in the group. The idea is that users selected by Applications Quest will yield the same, if not better, results as the entire population of potential test users. Applications Quest would pose a solution that has a minimal cost, reproducible recommendations, and quality results. 11 4. EXPERIMENT An experiment was conducted to determine if given a usability group size N, Applications Quest could select a subset of users whose study results would be representative of the population. To determine if the group was representative, it was necessary that their results prove insignificantly different than those of the majority population. 4.1 EXPERIMENT DESIGN 4.1.1 DATA The data for this experiment was selected from a previous study done in the Human Centered Computing Lab at Auburn University. The seventy-two users in the study represented students from Science, Technology, Engineering and Mathematics (STEM) majors. Their ages ranged from 19-30 years old. Of the seventy-two participants seventy spoke English as their native language while the other two spoke English as their second language. There were twenty-one females and fifty-one males. In the study where this information was collected users? demographic data was collected in pre-surveys and their answers to the questionnaire about the software they used were in post-surveys. 12 4.1.2 MATERIALS This experiment?s results were stored and manipulated in Microsoft Access and Microsoft Excel. 4.1.3 PROCEDURE The data discussed above was imported into an Access database. To clean the data, participants? were filtered to make sure that pre- surveys had an accompanying post- survey. This experiment was conducted in two parts with the second part being done two different ways: 1) randoml y selected participants were chosen from a group of seventy-two users. Each participant was given a unique identifier from the original study, that identifier was used in this study as well to maintain their anonymity. To select users in this approach, a program was written to randomly select groups of participants using the time divided by the total population size as the seed. For each group size chosen, five trials were run. The group sizes selected was 5, 7, 13, 15, and 20. For each group size, random trials were run five times meaning that for each trial new participants were randomly chosen. Each random participants? answers to questions selected from the questionnaire were queried and placed in an excel spreadsheet where they could be tested for any significant difference from the entire population. The attributes chosen, wonderful-- -terrible, frustrating---satisfying, usable---not usable, and this medium was easy for me to use, from the questionnaire were based on a 5 point likert scale. Significant difference was tested on each group using Microsoft Excel?s formula for the t-test. The t-test is an analysis tool that tests for equality of a population for each 13 underlying sample. The t-test can employ three different assumptions but for this experiment the two-sample unequal variance assumption will be utilized. A two- sample unequal variance means it is assumed that the two samples used have come from distributions with unequal variance and is used to determine whether the distributions have equal population means. Once calculated, the results of the t-test were analyzed to see if the randomly selected group could be considered representative of the population. 2a) All pre-survey demographic data was loaded into a database and run by Applications Quest. Applications Quest was given a specified number of clusters to return and from those clusters it chose the most representative person of each cluster. Once the participants were chosen their answers to questions selected from the questionnaire were queried and placed in an excel spreadsheet where they could be tested for any significant difference from the entire population. The same attributes for the first part of this experiment were employed here as well. Significant difference was again tested with the t-test and the results analyzed for comparison. 2b) The same data loaded into the database for part 2a was used to run Applications Quest again. The algorithm however for part 2b was changed to select the most different person from each cluster. In the case where a cluster contained only two participants, Applications Quest would select the participant most different from the entire population. Again the same attributes were used for querying and results were tested and analyzed using the t-test to determi ne significant difference. 14 4.2 RESULTS Once the statistical analysis tools had been applied as described in both approaches to each group size chosen for this experiment, the results were as follows: Figure 2: Comparison of statistical differences for group size 5: Random Trial vs. Applications Quest As seen in the graph above, the random trials for group size five were very inconsistent; through each trial the results varied considerably from the trial before. In Figure 2, 60 percent of the random trials for group size five were found significantly different for the attribute wonderful---terrible, meaning that their p-value was below .10 or did not meet the 90 percent confidence level set as acceptable for this experiment. This suggests that a usability professional has a 40 percent chance of randomly selecting five participants representative of the targeted population. For the attributes wonderful--- terrible and frustrating---satisfying in this table, Applications performed better than four of the five random trials. On the fifth random trial it was equal to wonderful---terrible 0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 Experiment Method P- Value wonderful---terrible frustrating---satisfying usable---not usable medium was easy to use 15 and it was slightly behind frustrating---satisfying. As you?ll continue to see Applications Quest maintains its accuracy and confidence. 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-7 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 3: Comparison of statistical differences for group size 7: Random Trial vs. Applications Quest Here in Figure 3, trials rand-(1), rand (5), and Applications Quest each produced all insignificantly different attributes. The p-values for the random trials were able to surpass those of Applications Quest, but the probability of those trials being selected was only 40 percent. The other random trials were able to generate attributes with insignificant difference but they still maintained a level of inconsistency. 16 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-13 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 4: Comparison of statistical differences for group size 13: Random Trial vs. Applications Quest In figure 4, 80 percent of the random trials produced insignificantly different results while Applications Quest was only able to produce two attributes that were insignificantly different. The random trials clearly outperformed Applications Quest but the size of the group makes the results questionable. The group size represents approximately 20 percent of the total populace and from the previous trials it has been shown that there exists a smaller subset of participants that can yield similar results. 17 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-15 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 5: Comparison of statistical differences for group size 15: Random Trial vs. Applications Quest Figure 5 shows that as more participants have been randomly selected, the results for the random trials got better across the board. Applications Quest throughout each of the trials has maintained a steady level of consistency by matching the results or performing better. Although the random trials present a high level of confidence, the number of participants is steadily increasing as would the price for user testing. 18 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-20 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 6: Comparison of statistical differences for group size 20: Random Trial vs. Applications Quest Again in Figure 6, the results improved as more participants were selected randomly. From the graphs above it can be seen that Applications Quest although predicted to outperform the random trials was only able to match the results and in some cases perform less than expected across the board. These facts at the onset seemed disheartening but with further investigation of the data it can also be seen that when selecting individual attributes from the study, Applications Quest was able to demonstrate that it could select participants whose post-survey results were insignificantly different from those of the population. For example, in all of the above figures Applications Quest was able to select participants in every trial that were representative of the population for the attributes wonderful---terrible and frustrating--- satisfying. Because the results of Applications Quest in comparison with the random results initially seemed unaligned 19 with this experiment?s hypothesis, approach 2b (Applications Quest with a revised algorithm) was designed and the results are as follows: 0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-5 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 7: Comparison of statistical difference with revised algorithm for group size 5: Random Trial vs. Applications Quest Figure 7 illustrates that even with a change in the Applications Quest algorithm, it still was able to produce exemplary results in comparison to the random trials. Random trial one was able to generate all attributes that were insignificantly different but the subsequent trials still demonstrated highly random results. When broken down into individual attributes, the attributes usable---not usable and frustrating---satisfying were consistently above the 90 percent confidence threshold for all trials, random and Applications Quest. The Applications Quest trial was able to produce three attributes that were insignificantly different and whose p-values were large enough in value to support a high confidence level in the participants selected. 20 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-7 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 8: Comparison of statistical difference with revised algorithm for group size 7: Random Trial vs. Applications Quest In the graph above, Applications Quest was able to produce all attributes with insignificant difference. Random trials one and three also successfully generated a complete set of attributes insignificantly different from the population. An interesting fact revealed from this figure and figure 3, Applications Quest group size 7 was able to produce complete sets of attributes insignificantly different from both versions of its algorithm. No other group size from the Applications Quest trials was able to demonstrate this. In this graph it can also be seen that the attribute usable---not usable was found insignificantly different in both the Applications Quest trial and the random trials. 21 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-13 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 9: Comparison of statistical difference for group size 13:Random Trial vs. Applications Quest with revised algorithm Here in Figure 9, 80 percent of the random trials produced insignificantly different results. Applications Quest was only able to generate two insignificant attributes. Frustrating---satisfying was the only attribute to consistently prove insignificant across each trial. Random trial three was able to provide a high level of confidence for each attribute but the point still remains throughout this experiment that the goal is to find the minimum number of participants that provide the same confidence level or higher. 22 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-15 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 10: Comparison of statistical difference for group size 15: Random Trial vs. Applications Quest with revised algorithm Figure 10 continues to demonstrate that the revised algorithm for Applications Quest can produce some attributes with insignificant difference but does not hold the proficiency that the original algorithm does. The graph shows that the random trials were superior in selecting participants as a complete set, but with an attribute breakdown, for the attributes frustrating---satisfying and usable---not usable Applications Quest selected the participants with a higher level of confidence. The increased level of confidence in the individual attributes suggests that as more participants are selected they represent a larger portion of the population, the population representing the targeted product audience; although the population is assumed to be somewhat similar, there should also 23 exist some difference. In the case of this experiment, the data used can be considered mildly homogeneous thus the increase in confidence. 0.000 0.100 0.200 0.300 0.400 0.500 0.600 rand-(1) rand-(2) rand-(3) rand-(4) rand-(5) AQ-20 Experiment Method P-Value wonderful---terrible frustrating---satisfying usable---not usable medium is easy to use Figure 11: Comparison of statistical difference for group size 20: Random Trial vs. Applications Quest with revised algorithm In Figure 11, Applications Quest produced two attributes whose results were insignificantly different. Throughout this part of the experiment (approach 2b), Applications Quest has done exceedingly well in selecting those two attributes, but across the board the random trials have sufficiently proven much better than Applications Quest. 80 percent of the random trials successfully selected participants representative of the population, this appears satisfactory but in the grand scheme of budgeting this number may be too large. 24 Although the revised algorithm for Applications Quest did not return all groups of users representative of the populace as hoped, meaning for every trial there were no groups significantly different; it did however present some other interesting results. Reviewing figures 7-11, one can see that the smaller groups? results returned were more representative than the random trials, percentage wise. Because the goal is to find the lowest number of participants representative of the population with a great number of certainty, Applications Quest outperformed the random samples. Another interesting finding was that group size 7 was returned from Applications Quest as the only group size that was found 100 percent insignificantly different in both approaches. This suggests that if a usability professional were designing a study he could be 100 percent certain that the group selected by Applications Quest would provide him the same results as the other seventy-two participants in the targeted population. With this certainty, the professional could use the seven users versus the seventy-two and save money on user testing, stay on budget for the usability design portion, and even stay on schedule for the time allotted to testing. In the random trials, the results were promising but the problem was that the trials were too unpredictable. In Applications Quest, the algorithm for selection is the same every time, choose the participant that is the most similar or most different depending on the algorithm used. The random trial results returned the larger group as most representative which becomes a problem because the idea is to find the minimum number of participants. Group sizes 13, 15, and 20 had very high percentages of trials that were insignificantly different but that does not say much because that is almost a fourth of the populace. 25 In comparison, the Applications Quest algorithm for selecting the most similar participant was more effective in selecting a better percentage of representative groups. Also in the attribute breakdown, the most similar algorithm returned more attributes with 100 percent certainty of insignificant difference. These facts suggest that although both algorithms performed in close proximity, choosing which algorithm to use comes down to the goals of the study. If the study aims to find users that are most representative of the targeted population, they would use the most similar algorithm. If the study aims to find the most diverse yet similar users, they would use the most different algorithm. 26 5. CONCLUSIONS 5.1 CONCLUSIONS The goal of a usability study is to work out the kinks of software and improve user satisfaction. When the job of designing a study and finding test users becomes too taxing, many designers and researchers abandon user testing and simply rely on heuristic evaluation. Those researchers who take on the task often find that recruiting users is daunting and time-consuming. The aim of this experiment was to help find a plausible solution to selecting participants for studies by using Applications Quest. Applications Quest would take a group of size N and from that group select participants that would be representative of the population. The selected participants would help reduce costs by minimizing the number of participants necessary while still maintaining result quality. Two approaches were used for comparison, random selection and Applications Quest selection. Although the random trials in this experiment were able to compete with the results of Applications Quest, Applications Quest was able to present results that were insignificantly different as well as consistently reproducible. The random trials were unpredictable and that fact does not lend to guarantee certainty or reliability when necessary in selecting participants. Upon comparing Applications Quest to itself when revising its original algorithm, the original version (selects the most representative) executed more effectively than the algorithm selecting the most different user. These findings suggest that Applications Quest could very well be a promising solution to the 27 issue of user recruiting and selection. The results are reproducible and consistent and with more experimentation it could guarantee a higher percentage of insignificant difference. 5.2 FUTURE WORK As stated previously and as seen in the graphs, the random trials of this experiment were able to contend with Applications Quest. With further research, we would like to see if Applications Quest can eliminate the competition. We would like to run the same experiment on less homogeneous data as well as run it on larger datasets. On a less homogenous larger distribution of data we may be able to find that Applications Quests can select the most representative more efficiently because a larger distribution will lend itself to comparing a less dense cluster. We would also like to try this experiment with more demographic data. If we could outline different sets of demographic data with human subjects into Applications Quest, maybe we could see a trend in what information usability professionals could use in recruiting participants. 28 REFERENCES 1. Faulkner, Laura. (2003) Beyond the five-user assumption: Benefits of increased sample sizes in usability testing, Behavior Research Methods, Instruments, & Computers. (March 9, 2007) from www.geocities.com/FaulknerUsability/Faulkner_BRMIC_Vol35.pdf 2. Nielsen, Jakob. (2000, March) Why You Only Need to Test With 5 Users, Alertbox. (March 9, 2007) from www.useit.com/alertbox/20000319.html 3. Heim, Steven. The Resonant Interface: HCI Foundations for Interaction Design, Pearson Addison Wesley, (2008). 4. Carroll, John M., Rosson, Mary Beth. Usability Engineering: Scenario-Based Development of Human-Computer Interaction, Morgan Kaufmann Publishers, (2002). 5. Landesman, Lori., Perfetti, Christine. (2001, June) Eight is not Enough, User Interface Engineering. (March 29, 2007) from www.uie.com/articles/eight_is_not_enough/. 6. Nielsen, Jakob. (2000, March) Recruiting Test Participants for Usability Studies, Alertbox. (March 9, 2007) from http://www.useit.com/alertbox/20030120.html. 7. Gilbert, Juan. (2004) Applications Quest. (March 29, 2007) from www.applicationsquest.com. 29 8. (2005, Feburary) Arteology: Sampling. (March 27, 2007) from www.uiah.fi/projects/metodi/152.htm 9. Rosenstein, Aviva. (2001, September) Managing Risk with Usability Testing, Classic System Solutions. (March 29, 2007) from http://www.classicsys.com/css06/cfm/articlesusability.cfm. 30 APPENDIX A Table 1. Trial results from random selections Terrible-Wonderful Frustrating-Satisfying Usable-Not Usable Easy medium rand-5(1) 0.2597 0.2255 0.1941 0.2125 rand-5(2) 0.0058 0.0238 0.4758 0.0066 rand-5(3) 0.3016 0.0414 0.2065 0.1631 rand-5(4) 0.0686 0.2255 0.2856 0.0073 rand-5(5) 0.0707 0.3387 0.2593 0.0837 rand-7(1) 0.2265 0.4155 0.3289 0.1511 rand-7(2) 0.0953 0.1658 0.2300 0.3665 rand-7(3) 0.4797 0.0506 0.3652 0.3308 rand-7(4) 0.1928 0.0840 0.1462 0.0980 rand-7(5) 0.1791 0.3075 0.3289 0.2157 rand-13(1) 0.3089 0.2958 0.1721 0.1610 rand-13(2) 0.2724 0.2246 0.0703 0.4017 rand-13(3) 0.3261 0.3175 0.3598 0.4957 rand-13(4) 0.2071 0.2167 0.2485 0.1228 rand-13(5) 0.1265 0.4188 0.3874 0.2883 rand-15(1) 0.1759 0.2953 0.1094 0.4735 rand-15(2) 0.4674 0.4276 0.3413 0.0362 rand-15(3) 0.4706 0.4259 0.3587 0.4798 rand-15(4) 0.2138 0.1276 0.2983 0.3749 rand-15(5) 0.1759 0.1843 0.0754 0.4762 rand-20(1) 0.0892 0.0254 0.3141 0.0656 rand-20(2) 0.2292 0.1747 0.2366 0.2873 rand-20(3) 0.2677 0.4526 0.4934 0.4857 rand-20(4) 0.4689 0.2289 0.2126 0.2807 rand-20(5) 0.2471 0.4069 0.2423 0.3557 31 Table 2. Trial results from Applications Quest (most similar algorithm) Terrible-Wonderful Frustrating-Satisfying Usable-Not Usable Easy medium appsquest - 5 0.3016 0.3014 0.0101 0.0066 appsquest - 7 0.1794 0.3439 0.1118 0.1062 appsquest - 13 0.4269 0.1831 0.0489 0.0636 appsquest - 15 0.4108 0.3090 0.4537 0.0597 appsquest - 20 0.4296 0.3143 0.4295 0.0480 Table 3. Trial results from Applications Quest (most different algorithm) Terrible-Wonderful Frustrating-Satisfying Usable-Not Usable Easy medium appsquest - 5 0.4762 0.0144 0.4295 0.4473 appsquest - 7 0.2756 0.4808 0.2611 0.1420 appsquest - 13 0.0740 0.4059 0.3532 0.0630 appsquest - 15 0.0634 0.4611 0.4502 0.0436 appsquest - 20 0.0407 0.4125 0.3807 0.0583