TRANSIENT ERROR AND COEFFICIENT ALPHA: A CALL FOR CAUTIOUS PRACTICE WHEN APPLYING AND INTERPRETING ALPHA IN PERSONNEL SELECTION SETTINGS Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This does not include proprietary or classified information. Christopher S. Winkelspecht Certificate of Approval: Philip Lewis Adrian Thomas, Chair Professor Associate Professor Psychology Psychology John G. Veres, III Frank Weathers Professor Associate Professor Psychology Psychology Joe F. Pittman Interim Dean Graduate School TRANSIENT ERROR AND COEFFICIENT ALPHA: A CALL FOR CAUTIOUS PRACTICE WHEN APPLYING AND INTERPRETING ALPA IN PERSONNEL SELECTION SETTINGS Christopher S. Winkelspecht A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 15, 2006 TRANSIENT ERROR AND COEFFICIENT ALPHA: A CALL FOR CAUTIOUS PRACTICE WHEN APPLYING AND INTERPRETING ALPHA IN PERSONNEL SELECTION SETTINGS Christopher S. Winkelspecht Permission is granted to Auburn University to make copies of this dissertation at its discretion, upon request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii DISSERTATION ABSTRACT TRANSIENT ERROR AND COEFFICIENT ALPHA: A CALL FOR CAUTIOUS PRACTICE WHEN APPLYING AND INTERPRETING ALPHA IN PERSONNEL SELECTION SETTINGS Christopher S. Winkelspecht Doctor of Philosophy, December 15, 2006 (M.S., Auburn University, 2003) (B.A., Temple University, 1999) 91 Typed Pages Directed by Adrian Thomas Reliability is an integral component in determining the worth of results from any measure. There are a number of estimates used to represent reliability, which vary in terms of the sources of error addressed, underlying assumptions about the data, statistical theory, and formulae applied; but in the areas of personnel selection research and practice coefficient alpha (also known as Cronbach?s alpha and simply referred to as alpha below) is by far the most widely reported. Alpha?s popularity is mostly due to two commonly accepted properties of the statistic. First, it is a measure of internal consistency so, unlike test-retest or inter-rater estimates of reliability, data used to generate the coefficient can be gathered from a single test administration. Second, alpha is generally considered a conservative statistic, or more specifically the coefficient is thought to estimate reliability?s lower boundary. While the convenience of calculating alpha is inarguable, iv the assumption that it is an underestimate of reliability is not always warranted. It has recently been demonstrated that transient errors, a source of variability not often assessed, can actually inflate the alpha coefficient and cause reliability to be overestimated. The current study investigates the effect of transient error and echoes the call to present additional diagnostic information that has recently been introduced to the professional literature, such as Alpha?s Standard Error (ASE; Duhacheck & Iacobucci, 2004). The benefits of calculating confidence intervals surrounding the alpha coefficient and substituting the upper and lower boundaries in place of the point estimate when performing a variety of calculations used in personnel selection practice and research are demonstrated. The present research calls for greater caution interpreting and applying reliability estimates in this high stakes setting. v ACKNOWLEDGEMENTS The author would like to share his gratitude for those who helped make this work possible. Many thanks are due to Dr. Adrian Thomas for his countless time and contributions as chair of this dissertation committee. Thank you Dr. John G. Veres, III for all of the knowledge and experiences you have shared. Dr. Frank Weathers, thank you for your continued support and service on both my thesis and dissertation committees. Thank you Dr. Allison Jones-Farmer for the expertise and feedback you have provided. Dr. Philip Lewis, thank you for all of the support you have given me and the entire Industrial/Organizational psychology program at Auburn. Both this research and I have greatly benefited from the support and assistance this committee has provided. I am greatly appreciative. Special thanks to my mother, Susan, for her lifelong love and devotion, without which I would have never been able to pursue this degree. Likewise, thanks to my sister, Kathleen, who always cheers me on and up. Thanks to my wonderful wife, Cami, for the love, patience, and encouragement she constantly provides. Finally, I would also like to extend my gratitude to all my family and friends who have provided boundless support throughout my tenure as a graduate student. So many people have helped me in so many ways. I am neither able to thank you all here nor thank you all enough, but I deeply appreciate your generous support. vi Style manual or journal used: Publication Manual of the American Psychological Association, 5 th edition Computer software used: Microsoft Word 2003, Statistical Package for the Social Sciences (SPSS), 11.5, Microsoft Excel 2003, and PROC MULTEVENT. vii TABLE OF CONTENTS I. LIST OF TABLES....................................................................................................x II. LIST OF FIGURES................................................................................................ xi III. INTRODUCTION ...................................................................................................1 Personnel Selection and Test Reliability .................................................................2 Reliability Estimates ................................................................................................3 The Internal Consistency Method and Coefficient Alpha .......................................5 Assumptions of Coefficient Alpha...........................................................................7 Transient Error.......................................................................................................10 Test Score Banding................................................................................................17 Using ? LB to Set the Band Width...........................................................................20 Confidence in Test Results ....................................................................................21 Using ? LB in Personnel Selection ..........................................................................24 IV. METHODOLOGY ................................................................................................29 Sample....................................................................................................................29 Measure..................................................................................................................29 Analyses.................................................................................................................33 V. RESULTS ..............................................................................................................35 Test-retest Alpha....................................................................................................35 Alpha?s Standard Error and Confidence Interval...................................................36 Alpha and the Spearman Brown Prophecy Formula..............................................36 Alpha and Correction for Attenuation ...................................................................38 Alpha and Test Score Banding ..............................................................................39 VI. DISCUSSION........................................................................................................43 Test-retest Alpha....................................................................................................44 Confidence Interval Alpha.....................................................................................45 Utilizing the Confidence Interval...........................................................................47 Correction for Attenuation.....................................................................................50 SED Banding .........................................................................................................51 Judging the Use of ? LB as a Professional Practice.................................................56 viii Conclusion .............................................................................................................59 REFERENCES .....................................................................................................61 APPENDICES ......................................................................................................68 Appendix A............................................................................................................69 Appendix B ............................................................................................................70 Appendix C ............................................................................................................72 Appendix D............................................................................................................74 ix LIST OF TABLES Table 1: Racial differences using different selection techniques.............................77 Table 2: Hypothetical Score Distribution and Test Score Use? ............................78 Table 3: Outcome of Spearman-Brown Prophecy Formula Using ? and LB ? ?. ..27 Table 4: Outcome of Correction for Attenuation Formula Using ? and LB ? ? ....28 Table 5: KSA and Testing Modalities for Sergeant Selection Procedure????79 Table 6: Descriptive Statistics for 1999 and 2001 Administrations? ?????35 Table 7: SBPF Using Traditional ? and ? LB ???????????????37 Table 8: SBPF Using Traditional Alpha and LB ? (solving for i)?????????..37 Table 9: Correction for Attenuation using Traditional Alpha, ? LB , and ? UB ??...39 Table 10: SED Bands using Alpha and ? LB ???????????????...40 Table 11: Racial Composition of Selected Test-takers by Selection Ratio??..?.41 Table 12: Adverse Impact Calculations by Selection Ratio?????????..42 x LIST OF FIGURES Figure 1: Item covariance matrix for test-retest data ?????.?????.......80 Figure 2: Confidence Interval and Other Reliability Estimates ???????...36 xi INTRODUCTION Reliability is an integral component in determining the worth of results from any measure. There are a number of estimates used to represent reliability, which vary in terms of the sources of error addressed, underlying assumptions about the data, statistical theory, and formulae applied; but in the areas of personnel selection research and practice coefficient alpha (also known as Cronbach?s alpha and simply referred to as alpha below) is by far the most widely reported. Alpha?s popularity is mostly due to two commonly accepted properties of the statistic. First, it is a measure of internal consistency so, unlike test-retest or inter-rater estimates of reliability, data used to generate the coefficient can be gathered from a single test administration. Second, alpha is generally considered a conservative statistic, or more specifically the coefficient is thought to estimate reliability?s lower boundary. While the convenience of calculating alpha is inarguable, the assumption that it is an underestimate of reliability is not always warranted. It has recently been demonstrated that transient errors, a source of variability not often assessed, can actually inflate the alpha coefficient and cause reliability to be overestimated. The current study investigates the impact of transient error and echoes the call to present additional diagnostic information that has recently been introduced to the professional literature, such as Alpha?s Standard Error (ASE; Duhacheck & Iacobucci, 2004). The benefits of calculating confidence intervals surrounding the alpha coefficient and substituting the upper and lower boundaries in place of the point estimate when 1 performing a variety of calculations used in personnel selection practice and research are demonstrated. The present research calls for greater caution interpreting and applying reliability estimates in this high stakes setting. Personnel Selection and Test Reliability Personnel selection is the process through which individuals are identified and hired to fill vacancies within an organization. In order to comply with federal regulations, organizations create selection systems in accordance with various laws (e.g., Title I of the Civil Rights Act of 1991; The Americans with Disabilities Act, 1990) and following professional guidelines (e.g., The Uniform Guidelines on Employee Selection, 1978; The Principles for the Validation and Use of Personnel Selection Procedures, 2003). The ultimate goal of a personnel selection system is to hire individuals who will provide a return on the organization?s investments in them. A considerable amount of time, money, and other resources are spent recruiting, selecting, training, and compensating individuals to perform services for the organization. An organization?s selection procedures help identify the people who have the knowledge, skills, and abilities (KSAs) important to successfully perform the duties of the positions being filled (either immediately or after training). There are numerous selection devices that can be used to assess applicants? KSAs but whatever the technique (e.g., written exam, structured oral interview, role play) the substance of the assessment (i.e., the content and results) must be valid and reliable. A valid assessment measures what it is intended to measure. If an interview is designed to gauge applicants? management capability the content should focus on general management issues and not involve other areas of knowledge, such as finance, or other 2 abilities, such as mathematical aptitude. The results should also correlate with job-related variables such as training proficiency and/or job performance. Furthermore, it is critically important that results be reliable, such that if an identical assessment were to be administered the applicants? performances would be replicated. Unfortunately, no test is perfect; irrelevant factors are certain to influence results, thus, assessments can neither be perfectly valid nor perfectly reliable. Irrelevant factors stem from a wide variety of sources ranging from internal elements of the test (such as poorly worded questions) to external elements of the environment (such as the degree to which others are observing the test-taker). Any source of influence unrelated to the area of knowledge, skill, or ability being assessed is considered measurement error. Measurement error confounds the interpretation of test results and can significantly decrease an organization?s ability to choose the most highly qualified individual(s) from among less desirable applicants. The ability to detect the presence of measurement error stems from reliability research within the area of psychometric and statistical theory. Reliability Estimates Statistical formulae used to assess the reliability of a measure have been generated on the basis of several theoretical models, but the most influential has been the Classical True Score Model also more simply known as Classical Test Theory (CTT). CTT?s origins began during Spearmen?s work with the correlation coefficient in the early 1900?s. In a series of publications from 1904-1913 Spearmen presented logical and mathematical proofs that scores from any test are inaccurate measures, to a certain degree, of whatever trait is being assessed (Crocker & Algina, 1987). The basis of the CTT model rests on Spearmen?s assertion that any test score can, and should, be 3 considered the composite of two hypothetical elements: a true score and some quantity of random error. X o = T x + E x (Equation 1) Based on this model (where X o represents the observed score, T x is the true score, and E x represents measurement error), reliability coefficients are formulated to represent the extent to which examinees? scores on a test covary with the ?true? extent to which they possess the knowledge, skills, or abilities that are related to that being tested. The correlation between observed and true scores is estimated by taking the square root of the reliability estimate (Equation 2). xx r = (Equation 2) X XT r For example, if .85 is found to be the reliability coefficient for a measure of conscientiousness this suggests that individuals? observed scores are linearly correlated with their true scores at .92 ( 85. = .92). There are other theories of reliability, such as Generalizability Theory and Domain Sampling, which propose models with slightly different emphases, usually both theoretically and mathematically. The convergences and contrasts among the different theories and their applications is beyond the scope of the current work but it should be kept in mind that the discussion below would be presented somewhat differently if a model other than CTT served as the theoretical basis. Even within CTT there are numerous ways to estimate the reliability of scores (e.g., internal consistency, test-retest, form equivalence; Thompson, 2003). The present work will specifically focus on variants of the parallel test approach under CTT (namely, the internal consistency method). Creating parallel tests is a general strategy for assessing the stability of an 4 individual attribute. The method is to develop two forms of a test that produce consistent (with respect to rank-order) true scores for people on both the first and second version (parallel tests). After individuals complete both forms their scores can be compared and any differences can be attributed to measurement error. Great rank-order differences between scores on the two tests should raise concerns about measurement error, while consistent rank-order should foster confidence in the stability of the results. Strictly parallel tests are extremely difficult to create and so this approach mostly serves as a conceptual foundation for more practical methods in research and practice. These methods include the test-retest correlation and various derivations of the internal consistency method. However, applied personnel practitioners rarely have the data to investigate test-retest correlations, so the focus of the present work will center on the internal consistency method. The Internal Consistency Method and Coefficient Alpha The internal consistency method offers a fairly simple answer to the question ?what is reliability? if each item on a test is considered a distinct behavioral observation. The more observations one takes and the greater the consistency among those observations, the higher the reliability estimate (Murphy & Davidshofer, 2001). Internal consistency estimates are dependent upon the number of observations reported (i.e., items) and the extent to which those observations covary from instance to instance (i.e., the intercorrelations among those items). Coefficient alpha is the most commonly generated statistic of the internal consistency method. Cronbach (1951) originally introduced coefficient alpha as an extension to one of Kuder and Richardson?s (1937) reliability estimates, known as Kuder-Richardson #20 (or 5 KR20). Kuder and Richardson presented several formulae designed to summarize the reliability of a test with more than one item, which are all dichotomously scored (i.e., 0 = incorrect and 1 = correct; Knapp, 1991). The best known and most commonly referenced is the 20 th equation presented in that work. The KR20 formula is: ? ???= )]/1(1)[1/(20 2 Tii sppkkKR (Equation 3) Where k is the number of items, ip is the proportion of people with a score of 1 on the k th item, and is the variance of the scores on the total test. 2 T s After the introduction of the KR20 statistic, Hoyt (1941) demonstrated that the same general calculation could be produced through a repeated-measures analysis of variance approach to the subject-by-item data matrix. Several years later Cronbach (1951) extended the work to include tests with more than two choices for each item. Both KR20 and alpha are linked to the parallel test approach through the split-half method, where a single test is divided into parts and scores on one part are correlated with the scores on the other (Murphy & Davidshofer, 2001). Specifically, alpha represents the mean correlation that would result from every possible combination of split tests. The general formula for Cronbach?s coefficient alpha is: ? ? ? ? ? ? ? ? ? ? ? ? ? ? = ? = 2 1 2 1 1 T k i i k k ? ? ? (Equation 4) Where k is the number of items in the test, is the variance of the i 2 i ? th item, and is the total test variance. 2 T ? 6 There are a variety of commonly accepted interpretations of what alpha measures. For example it has been stated (Cortina, 1993) that alpha is: 1) the mean of all split-half reliabilities (Cronbach, 1951); 2) the lower bound of reliability of a test (Kristoff, 1974; Novick & Lewis, 1967); 3) a measure of first-factor saturation, or unidimensionality (Crano & Brewer, 1973; Hattie, 1985); 4) equal to reliability in conditions of essential tau-equivalance (?-equivalence); and 5) a more general version of the KR20 coefficient (Cronbach, 1951; Fiske, 1966, Hakstian & Whalen, 1976). Crocker and Algina (1986) offer the following interpretation (p. 120): ?When a composite test is made up of nonparallel subtests, we can estimate the lower bound of its coefficient of precision by using coefficient alpha. This computation requires that we know the number of subtests, the variance of the composite scores, and the sum of all the subtest covariances. The usefulness of this relationship will be more apparent if we recall that any test may be regarded as a composite and each item as a subtest. Thus, coefficient alpha provides a convenient way to estimate the lower bound of the coefficient of precision for a test by using item response data obtained from a single administration of that test.? Assumptions of Coefficient Alpha In order for alpha to be interpreted as an accurate reflection of reliability two assumptions must hold true. First, while the items need not be perfectly parallel they must be essentially ?-equivalent 1 . It has long been known that the assumption of ?- equivalence is routinely violated and mathematically demonstrated that when this occurs alpha will produce a lower bound estimate of reliability (Novick & Lewis, 1967). However, it has been noted that a basis for the calculations that led to this conclusion is that the second assumption must hold true (Zimmerman, Zumbo, & Lalonde, 1993); error associated with individual items must not be correlated with the errors of other items. 7 1 Items are defined to be essentially ?-equivalent when true score counterparts differ only by an additive constant. Zimmerman et al. (1993) investigated the effects of the two violations separately and found that violation of the ?-equivalence assumption produces a deflated reliability coefficient while violation of uncorrelated errors produces an inflated estimate. However, the result of each violation in isolation does not indicate how simultaneous violations of both assumptions will effect the reliability estimate. The possibility existed that one violation trumped the other or that violations of both at the same time creates a wash effect where neither exert enough influence to inflate or deflate the estimate. It was not until Komaroff (1997) investigated the effects of simultaneous violations of essential ?-equivalence and uncorrelated error on coefficient alpha that a solid understanding of the impact was made known and the alpha coefficient could be fully interpreted. To determine the interactive effect of simultaneous violations of both essential ?-equivalence and uncorrelated error Komaroff (1997) varied true score correlations among items on a hypothetical assessment, error score correlations, and the number of items with correlated error. Results demonstrated that correlated error attenuates the degree to which alpha underestimates p xx? 2 xx? under violations of essential ?-equivalence, and when this effect is most pronounced alpha can overestimate p . Komaroff (1997) demonstrated that under violations of these dual assumptions alpha remains likely to be an overestimate of reliability?s lower boundary. To demonstrate, take the basic classical test theory linear model for two individual items: X 1 = T 1 + E 1 ; X 2 = T 2 + E 2 (Equation 5) 8 2 p xx? = ? 2 (T)/? 2 (X), where ? 2 (T) is the true composite variance and ? 2 (X) is the observed composite variance. Where X = observed score, T = true score and E = error score and the covariance between the two items is: COV(X 1 ,X 2 ) = 212 1 XXXX r?? = COV(T 1 ,T 2 ) + COV(E 1 ,E 2 ) + COV(T 1 E 1 ) + COV(T 2 E 2 ) (Equation 6) COV(T i , E i ) = 0, because true score variance cannot be associated with error variance by definition, so the last two terms drop out. COV(E 1 ,E 2 ) = 0 is an assumption and can be violated by data (Komaroff, 1997b). If COV(E 1 ,E 2 ) > 0, the sum of observed (X) item covariances will be inflated and coefficient alpha will be an overestimate of reliability. Returning to the formula for coefficient alpha (Equation 4), recall is the is the sum of all the test items? variances and covariances. If the sum of the covariances is inflated, will be inflated, which in turn will decrease the value of 2 T ? 2 T ? 2 1 2 T k i i ? ? ? = , increase the value of ? ? ? ? ? ? ? ? ? ? ? ? ? ? = 2 1 2 1 T k i i ? ? , and, to whit, cause alpha ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? = 2 1 2 1 1 T k i i k k ? ? to be an inflated estimate of internal consistency. Like most assumptions used as a basis for operation, the accuracy of COV(E 1 ,E 2 ) = 0 has not been greatly scrutinized. A potential reason for this oversight is evident returning to the Crocker and Algina reference (above). If alpha is indeed used as a calculation of the consistency among test scores (i.e., alpha reflects reliability among 9 separate tests, not items 3 ), it is reasonable to assume that whatever irrelevant factors contributed to performance on one test are not likely to be related to irrelevant factors that influenced performance on other tests. For example, if construction work is being performed outside of a classroom on a warm day and the air conditioning is out of order forcing the teacher to leave the window open throughout the testing period, this may negatively affect an individual?s performance on that test but the situation is not likely to reappear during other administrations. In this instance the error variance associated with test conditions would be relatively uncorrelated. In contrast, if alpha is used as a calculation of the consistency among items on that individual test, the noise could affect the student?s answers to any or all of the test questions, as the noise would be present from one to the next. In this instance, hypothetically different ?administrations? would have correlated sources of error variance. This may be a poor example to present to testing specialists who meticulously control locations to eliminate such ?noise.? Yet even experts can only control environmental factors. The influence of internal factors, such as an individual?s psychological state, cannot be so easily manipulated as the window. These transient errors, as they have been termed (Becker, 2000), which consist of fluctuations in test-takers? moods/affect/states, while random from one test occasion to another, might saturate many (if not all) of the items an individual answers throughout a single test administration. Transient Error Transient errors are response variations that are due to random changes in test- takers? psychological states across time. As the changes are not related to the construct(s) 10 3 Alpha can represent the internal consistency of data at the test or item level. being assessed by the measure, the variability produced by these random fluctuations should be considered error variance. While the influence of these psychological states is temporary and random from one time to another, it is quite likely the test-taker will remain in the same state while answering several questions, if not the entire test. Recall an underlying assumption for coefficient alpha is that errors among measurement units are uncorrelated, or, as Becker (2000) notes, ?that its violation entails distortion of inconsequential magnitude? (p. 373). If test-takers? psychological state affects their performance from item to item, the errors will be correlated. Becker (2000) remarked, ?there have appeared several articles reintroducing alpha, with special attention drawn to its proper use and concerns for its limitation, assumption violations, and misinterpretations (Cortina, 1993; Miller, 1995; Schmitt, 1996)? yet, the violation of the assumption of uncorrelated errors is mentioned in only one of these three articles, and there it is dismissed as likely being of little import? (p. 373). Following these review articles, a number of investigators began to reexamine the assumptions underlying alpha. The few that looked at the effects of correlated error demonstrated that it can significantly inflate estimates of reliability (e.g., Komaroff, 1997; Raykov, 1998). At first, research in this area was mostly conducted by psychmotericians and statisticians interested in mathematical proofs to their theories. The first application of the research revolved around means to partial out the error variance when correcting observed correlations, a more applied but still surely academic line of research. It wasn?t until Becker?s work that the issue of transient error and its effect on test score reliability associated with commonly applied measures was systematically examined through an empirical study. 11 Becker collected self-report ratings of approximately 400 undergraduate university students from three inventories, the Buss-Perry Aggression Question (BPAQ) scale (Buss & Perry, 1992), which contains four scales: Anger, Hostility, Physical Aggression, and Verbal Aggression; the Rosenberg self-esteem (RSE) Scales (Rosenberg, 1965), and the Gender-Free Inventory of Desirable Responding (GFIDR; Becker & Cherny, 1994). Following what he termed a staggered equivalent split-half procedure, error variance associated with transient error was able to be partialed out. Relative to true-score variance the magnitude of transient error was .067 for Physical Aggression, .003 for Verbal Aggression, .021 for Anger, and .145 for Hostility. Transient error was associated with 5.2% of total variance (relative to the estimated true score variance) for the RSE and 11.7% for the GFIDR. Becker concluded that depending on which assumption is more greatly violated, essential ?-equivalence or uncorrelated error, alpha can be a lower or upper boundary of reliability. While Thorndike (1951) called for investigations toward the influence of transient error (though he did not use the term) half a century before, Becker?s (2000) was the first substantive study to demonstrate its influence on the results of applied psychological assessments. Since Becker?s work, research on the topic has continued to appear (e.g., Reeve, Heggestad, & George, 2005; Vautier & Jmel, 2003), offering advanced understanding of the ways transient error influences test score interpretation and the best way(s) to assess transient error. For example, while Becker implemented a staggered split-half design, Schmidt, Le, and Ilies (2003) used an approach based upon the calculation of the Coefficient of Equivalence and Stability (CES). While a measure of internal consistency (coefficient of equivalence), such as alpha, can capture variance 12 associated with random response error and specific factor error, and test-retest measures (measures of stability) can account for random response error and transient error, only the CES can assess error variance from all three sources. Calculating the CES, along with one of the other types of measures, allows the influence of independent error sources to be determined. For example, the difference between the CES and a measure of stability should be due to specific factor error. Alternatively, the difference between the CES and a measure of internal consistency will stem from the influence of transient error. It was in this manner that Schmidt, et al. (2003) conducted their investigation. Schmidt et al.?s (2003) study replicated and expanded upon Becker?s findings. Transient error was found to be present in a variety of commonly used (particularly for personnel selection) measures of individual differences, but the extent to which transient error appeared varied considerably. These measures included the Wonderlic Personnel Test, a test of general mental ability; two separate measures of the Big 5 personality traits (i.e., Contentiousness, Extraversion, Agreeableness, Neuroticism, & Openness to Experience), namely the Personal Characteristic Inventory (PCI; Barrick & Mount, 1995) and the Internations Personality Inventory Pool (IPIP; Goldberg, 1997); Sherer, Maddux, Mercandant, Prentice-Dunn, Jacobs, and Rogers? (1982) Generalized Self-Efficacy Scale (GSE); two measures of self-esteem; and three measures of both positive and negative affectivity. While it was hypothesized that transient error would be smallest in the cognitive domain and largest in areas concerned with affective states (personality traits would fall in the middle depending whether they were more cognitively or affectively loaded), the results were not as straightforward. Transient error associated with the Wonderlic, the cognitive measure, was much less than that associated with positive and 13 negative affectivity (6.7% versus 17.8% and 14.5%, on average, respectively). The amount was equal to that calculated from the measure of self-efficacy (6.3%) and more than several personality factors (on average: Extraversion was 2.2%, Agreeableness was 3.6%, and Openness to Experience was 0.0%). Schmidt et al (2003) noted: ?the primary implication of these findings is that the nearly universal use of the CE (coefficient of equivalence) as the reliability estimate for measures of important and widely used psychological constructs, such as those studied in this research, leads to overestimates of scale reliability? (p. 218). As can be seen, transient error can only be calculated when a test is completed on two different occasions. Applied practitioners, such as selection analysts, may be left to wonder how these developments should influence their practice, since they rarely deal with data from repeated administrations. Becker called for more research investigating the effects of transient error on various measures in order to determine implications for research and practice. He noted the measures identified as less susceptible to transient errors should be used in place of those that are more liable to be affected by its influence. This is an important point as the alpha point estimate is typically reported without any indication toward the precision of the statistic; the results of two measures that produce similar alpha levels may appear to be equally well suited but the effects of transient error may be much greater for one. Unfortunately, the influence of transient error has only been investigated using a very small number of measures, as the sparse literature review above suggests. In the interim, one strategy that can be used to account for the overestimation of alpha is to include a confidence interval around the point estimate. Empirical and 14 conceptual research has demonstrated that the alpha point estimate may lack precision and accuracy in a number of commonly encountered testing situations (Charter & Feldt, 2002) and calls for the presentation of confidence intervals along with point estimates have been made to communicate shortcomings of the statistic. Yet while these calls have been presented in top journals of applied psychological research (e.g., Duhacheck & Iaccobucci, 2004) and educational measurement (e.g., Fan & Thompson, 2001), the impact of these works is not evident. Researchers and practitioners alike may be reluctant to proffer information beyond a reliability coefficient for a variety of reasons. Test developers may fear that the presentation of a lower boundary will negatively affect perceptions of their test, while those who use test results as decision making tools may fear such information will undermine their conclusions. More simply, the rarity of presentation could be due to ignorance regarding their calculation and/or their importance. Confidence intervals are visible reminders of the fact that a point estimate is not a perfect indicator of scores? true reliability. While this is important for professionals to keep in mind when they present results, it is crucial for the less statistically educated decision makers who use test information to make judgments (as they are more likely to accept the alpha point estimate at its value). Accepting the fact that not only are tests imperfect measures of whatever is being assessed but the statistics used to describe those tests are imperfect indicators as well is critical to the advancement of social science research and the legitimacy of its application. The considerable body of work surrounding corrections to effect sizes is a good example of such advancement. It has been long known (e.g., Johnson, 1944) that one of the effects of measurement error is the 15 attenuation of obtained effects from reaching the size that would exist if true values, free from error, were obtained. As Baugh (2002) notes, ?interpretation of effects without correcting for score unreliability is equivalent to assuming the scores are perfectly reliable even if evidence to the contrary is recognized? (i.e., the reliability coefficient is not 1.00; p. 256). Research has highlighted the common sources of noise that often produce lower effects and methods exist to correct for score unreliability (e.g., Hunter & Schmidt, 1994; Hunter, Schmidt, & Jackson, 1982). Such procedures yield an estimate of the effect size that one might expect to find in a perfect study (Rosenthal, 1994) and thus a true indication of the relation between variables. Correcting effects sizes for unreliability of scores has obvious benefits. As more accurate relations among measures and outcomes are revealed, the utility of those measures will be better understood and, too, their application. Yet, while comparing the score of one individual to another, such as for selection purposes, reliability information can do little more than provide a general degree of confidence in decisions. Since selection decisions usually stem from a single testing period (though a composite of tests may be used) it is unknown whether individuals? scores on a measure are higher than, lower, or spot on in comparison to their true scores. Thus no corrections can be made to individual scores. So while the reliability of a measure can be used to adjust obtained effects to better understand the true relations among variables, no formulae exist to adjust individuals? scores to better understand the true comparability of their attributes. The standard error of measurement (SEM) statistic does provide test users with some insight toward the accuracy of an individual?s test score, but it is not used to adjust obtained scores. While rank-order cannot be changed on 16 its basis, the standard error of measurement (more specifically the standard error of the difference, a statistic based upon the SEM) can be applied through a technique designed to create ranges of indifference among scores that compensate for the shortcomings of a test?s reliability. Test Score Banding While test score banding is very commonly applied (for instance, converting percentage correct on a test to a letter grade) a particular form of banding has created a great amount of controversy since its introduction. The technique is known as SED banding, where SED stands for standard error of difference (a statistic related to the reliability of a measure). SED banding (simply referred to as banding heretofore) is based upon the notion that small differences among scores on a test might not be meaningful because the discrepancy could be due to measurement error, as no selection device is perfectly reliable. When scores from an imperfect test are used to make selection decisions among a group of applicants, confidence in those decisions is only as great as the reliability of the measure. Cascio, Outtz, Zedeck, and Goldstein (1991) proposed a technique for creating a band around the highest score on a test so that decision makers can be confident that the scores outside the band are significantly different from the top score. Vice versa, all the scores within the band may not be significantly different from the top score in the band and so are all considered equal. The width of the band is calculated as: Bandwidth = (1.96) 2 [ x ? ( 1 - ? ) ] (Equation 7) [ x ? ( 1 - ? ) ] is more commonly known as the standard error of measurement (SEM) and 17 2 [ x ? ( 1 - ? ) ] is referred to as the Standard Error of the Difference (SED). ?The rationale for specifying the bandwidth as 1.96 X SED is borrowed from the classical hypothesis-testing convention that a null hypothesis should not be rejected if the observed data or more extreme data have at least a .05 probability 4 of occurring if the null hypothesis were true? (when scores are normally distributed; Kehoe & Tenopyr, 1994; p. 297). SED bands gained popularity because the range of indifference they establish was proposed as an opportunity to create greater opportunity for selecting protected group members. Since many of the most popular types of selection assessments have been demonstrated to produce score differences among Caucasians and protected classes (particularly African Americans, see Table 1), organizations that value diversity often face incompatible choices when it comes to selection measures: they can either choose the test that will yield the greatest return on their investment (e.g., a cognitive ability test; Schmidt & Hunter, 1998) at the expense of diversity goals or they can advance their diversity goals at the expense of their selection device?s utility. In an article addressing this issue, Cascio, et al. (1991) introduced several approaches to test scores use that could assist organizations, including several forms of test score banding 5 . While Cascio et al. demonstrated any approach other than strict top-down selection will result in the loss of an assessment?s utility, top-down selection can often result in adverse impact against protected classes. If one of the organization?s goals is to increase diversity, test score banding could be a viable alternative. 4 Following a normal distribution of scores, 95% will fall between +/- 1.96 standard deviations. 18 5 Several forms of test score banding exist (e.g., fixed vs. sliding); distinctions among them are inconsequential for the purposes of the present research, as any form may be substituted for another. The process and purpose of the technique is best understood through an example. Imagine there were 50 applicants being selected on the basis of the General Aptitude Test Battery (GATB), a cognitive ability exam. The selection ratio is 10%, i.e. five applicants, test scores ranged from 65 to 95, and the top African American scores were 86, 87, and 89 while five majority applicants scored higher (see Table 2). If selection were conducted top-down, from the highest to the lowest score, African Americans would have no chance of being selected before the selection ratio was met. If a band were created based on an SED equal to 5.00, a band would be created that would roughly range from 85 to 95. Scores above 85 would not be considered significantly different from the top score. Therefore, the top three African American scorers could possibly be selected (depending on the criteria for selection within the band). Proponents of this approach cite such examples as evidence SED banding can help reconcile the differences between an organization choosing a test high in utility and advancing a policy of diversity, through justified scientific means (Zedeck, Outtz, Cascio, & Goldstein, 1991; Cascio, Goldstein, & Outtz, 1995). Critics of SED banding contend the practice is neither scientific nor justified and have questioned the logic and utility of the technique (Schmidt, 1991; Schmidt, & Hunter, 1995). While theoretical arguments regarding the rationale driving the technique are beyond the scope of the present work, it should be noted that while opponents of banding have made many points that are sound and compelling, the technique has been embraced by practitioners and is not likely to be abandoned any time soon. As such, a practical research course is to determine the most appropriate width to set the band, and thus apply the technique. Guion (2004) provided the following guidance on the topic: ? the 19 reasoning guiding the judgment on the width of the range of indifference should be articulated well enough to be made a matter of written record. This is partly in recognition that business is done in an age of litigation, and one needs to have clear reasoning behind decisions affecting employment processes. (p. 56).? Using LB ? to set the bandwidth It has been proposed that a confidence interval be presented along with the alpha point estimate in all practical cases. Decision makers will be left to determine how to judge this information, but, hopefully, it will be clear the rank-order of candidates? scores is neither a perfect indication of their knowledge, skills, and/or abilities, nor is it likely to be a perfectly ordinal presentation of the amount candidates will contribute to the organization. While test scores cannot be corrected, ranges of indifference can be created and other factors can be introduced to the selection process. While it is not necessarily advocated here, it is proposed that, based upon the fundamental logic of the SED banding approach, it would be reasonable to substitute the lower boundary of a confidence interval in place of the point estimate when creating a band. This approach would provide test users with the greatest degree of assurance that differences between the band referent (i.e., the top score) and those scores lying outside of the band are truly significantly different. This approach would also create the largest range of indifference and maximize selection opportunities for members of lower scoring groups. Provided an organization has carefully considered the implications of applying the technique, the reason guiding the band?s width (Guion?s point) would merely be an extension of Cascio?s et al.?s (1991) rationale, which has been accepted in professional 20 practice. In fact, using the lowery boundary of alpha?s confidence interval is logically sounder, as this represents the point of demarcation (i.e., statistically significant difference) based upon the lowest degree of score reliability. The rationale behind the banding technique is to ensure that the top score is significantly different from those scores outside of the band, or those individuals that will not have an opportunity to be selected. A score outside of the band created with Cascio et al.?s (1991) formula may not be statistically significantly different because alpha may be an inflated estimate of score reliability. Using alpha?s lower boundary, based upon the calculation of a confidence interval, provides assurance 6 that this does not occur. Thus, Cascio et al.?s calculation would be amedended as such: Bandwidth = (1.96) 2 [ x ? ( 1 - LB ? ) ] (Equation 8) Where ? = the standard deviation of the test scores and LB ? = the lower boundary of alpha?s confidence interval Confidence in Test Results It is in the interest of personnel practitioners, as well as their clients, to be honest about properties of the tests they create and use. In light of recent research on transient error, it appears coefficient alpha, the favored reliability point estimate most practitioners report, is likely to overestimate the lower boundary of the result?s reliability. The presentation of a confidence interval will serve as a reminder that while everything can be done to ensure a test is valid and constructed as well as it possibly can be, there are factors beyond control that will affect the results of any test, though no one can know the full extent of those factors. 21 6 Using .05 as the chance level for the reliability estimate and the band. A unique data set, consisting of test-retest results from an applied setting, provides the opportunity to pull all of these lines of research together. First, the influence of transient error will be demonstrated via a recently proposed method for calculating a reliability estimate from test-retest data. While Becker (2000) and Schmidt, et al. (2003) used derivations of an alternate forms method to calculate transient error, such a design is impractical for researchers interested in applied settings, as this would entail every applicant completing two test administrations. Green (2003) formulated a model that allows transient error to be identified through the analysis of data from the repeated administration of a whole test. A reliability estimate based upon a true-score model with transient error was presented and proposed as a reformulation of coefficient alpha, named test-retest alpha. The coefficient is calculated as: 21 ? XX ? = 21 21 ?? ? ' 2 xx xjxj k ?? ? (Equation 9) Where k is the number of items on the test, 1 ? x ? is the standard deviation of the scale from the first administration, 2 ? x ? is the standard deviation of the scale from the second administration, and 21 ' ? xjxj ? is the average of pooled different time/different item covariances. Figure 1 is adopted from Green?s (2003) work and illustrates how different sources of variances are captured within and between test administrations. ?The test- retest alpha estimates true-score variance based on the different-time/different-item covariances, whereas coefficient alpha estimates true-score variance based on the same- time/different-item covariance? (Green, 2003; p. 89). The presence of transient error can 22 be empirically demonstrated when test-retest alpha is less than alpha. 7 Green also discusses the difference between test-retest alpha and the test-retest correlation. He notes: ?With test-retest correlations, estimates of true-score variance are affected not only by different-time/different-item covariances but also by different-time/same-time covariances. These latter covariances are likely to create an inflated estimate of reliability to the extent that respondents remember how they responded at Time 1 and respond similarly at Time 2? (Green, 2003; p. 89). Green?s research was purely mathematical; no empirical data were used to demonstrate his calculations. The present research will empirically demonstrate his work by calculating test-retest alpha and comparing it to both alpha (so transient error can be exposed) and a test-retest Pearson correlation coefficient (so test-retest alpha can be shown as an all-around more precise reliability estimate). While Green?s work will aid applied practitioners who wish for an accurate reliability estimate for the unusual case where they have test-retest data, like Becker?s (2000) and Schmidt et al.?s (2003) calculations, transient error cannot be identified from a single test administration. Therefore, the present study echoes calls for the presentation of confidence intervals along with point estimates of reliability, such as by Fan and Thompson (2001). Duhacheck and Iacobucci (2004) recently presented a statistic based on the distribution and standard error of coefficient alpha and demonstrated the superiority of the formula and the confidence intervals created from it in comparison to past calculations (e.g., Feldt & Ankenmann, 1999; Barchard & Hakstian, 1997). Based on the work of van Zyl, Neudecker, and Nel (2000), which presented an asymptotic distribution from the maximum likelihood estimator of the variance of coefficient alpha, 23 6 Provided true scores remain constant from the first to second administration. the authors formulated an estimate of alpha?s standard error (ASE). The authors argue that ?for the first time, applied and theoretical researchers are able to estimate the standard errors of their measures, thereby revealing precisely the magnitude and severity of the problem of measurement error with less restrictive assumptions on the data? (than past estimates), based upon the ASE (p. 792). For ASE, the distribution of alpha is derived as n ? ?, with )?( ???n following a normal distribution with a mean of zero and a variance of ? ? ? ? ? ? ? = 32 2 ) ? '()1( 2 ? jVjk k Q [ ]) ? ')( ? (2) ?? )( ? '( 222 jVjVtrVtrVtrjVj ?+ (Equation 10) Where n represents sample size, is the MLE of?? ? , j is a k x 1 vector of ones, and V is the population covariance matrix among the items (van Zyl et al., 2000). Set with a variance, in the article ASE was derived to equal ASE = n Q ? (Equation 11) and the appropriate confidence interval (approximately 95%), based on CTT hypothesis testing is ? ? 1.96 ? ? ? ? ? ? ? ? n Q ? (Equation 12) Using LB ? in personnel selection While alpha is termed a ?point? estimate, the research above (e.g., Becker, 2000; Komaroff, 1997; Schmidt, et al., 2003) demonstrates it is actually a rather blunt instrument for approximating reliability. When other assumptions are met, violation of 24 essential ?-equivalence deflates alpha as an estimate of reliability, while violation of the uncorrelated errors assumption inflates the statistic. In most applied settings, neither of these assumptions is likely to hold, so the obtained alpha coefficient will not present accurate information about the test results? stability. Presenting alpha with a confidence interval surrounding the point estimate will help better communicate the lack of precision in the measure being used. A likely practical effect of this presentation will be a decrease in the confidence decision makers who use results of the measure to differentiate among test-takers have in their decisions. One of the most likely courses of action personnel practitioners may embrace in such a situation is to create ranges of indifference, based upon the reliability of the test, so decision makers can increase their confidence that true differences do exist between the top scorers and those outside the range of indifference. The final areas of investigation demonstrate how the confidence interval boundaries surrounding coefficient alpha can be incorporated within a variety of equations to produce more cautious and prudent results. The examples chosen are only a sample of calculations commonly utilized in the practice and research of personnel selection that employ the alpha coefficient. In many cases, using the lower boundary of the reliability estimate can provide practitioners concerned with the inadequacies of their tests, and the statistics used to assess them, a cautious base from which to develop further calculations. As mentioned, one application is to create wider ranges of indifference by substituting ? LB in the SED banding equation (Equation 8). The rationale behind this approach is that only when the lower boundary of alpha?s confidence interval is used in the creation of the bands can one assert with great confidence (i.e., 95%) that based upon the reliability of the measure the scores outside of the top band are significantly different 25 (again, at 95%) than those scores that lie outside of the band. The results of this modification are likely to be greater opportunity for minority selections and reduced adverse impact from selection measures. A second example could be informative for practitioners involved in the test development stage of their selection system. Replacing the traditional point estimate with ? LB in the Spearman-Brown prophecy formula (SBPF) will provide test developers a more conservative estimate in determining the effect of increasing the number of items in their test. Namely, New ? level = LB LB i i ? ? )1(1 )( ?+ (Equation 13) Where i = the factor increase in items (e.g., i = 2 would be twice the current amount of items) Take an example where pilot test data shows the internal consistency of a measure to be .67 using ? as it is traditionally calculated and the lower boundary of the confidence interval is .57. By doubling the number of items, the test?s internal consistency would reach .80 using ? , but would only reach .73 using LB ? (see Table 3). While there is no uniformly accepted standard of what constitutes high (versus low) levels of reliability, .80 is often used as a threshold. However, the fact that the SBPF using traditional alpha overestimates the SBPF using LB ? by 10% should cause some concern. Looking at the example another way, if the test was comprised of 10 items and the developer wished to reach .80 as the level of reliability, using traditional alpha (in the SPBF) would suggest the revised test needs to contain 20 items while substituting LB ? would suggest 30 items are necessary to reach the .80 level. 26 Table 3 Outcome of Spearman-Brown Prophecy Formula Using ? and LB ? SBPF Using Traditional ? SBPF Using LB ? New ? = ? ? )1(1 )( ?+ i i = 67.)12(1 67.)2( ?+ = .80 New ? = LB LB i i ? ? )1(1 )( ?+ = 57.)12(1 57.)2( ?+ = .73 A third and final example could be applied to a variety of contexts in personnel selection practice and, particularly, research. It has become common practice to correct an observed correlation between two variables for attenuation due to measurement error. The general equation is 2211 12* 12 rr r r = (Equation 14) Where = the disattenuated correlation, = the observed (attenuated) correlation between variables 1 and 2, and are the reliability estimates for those variables. * 12 r 12 r 2211 rr Unlike the previous examples, in this case using LB ? will create a more liberal estimate of the correlation between the true scores of the constructs. Take the following example (Table 4) where a predictor measure and a criterion measure have an observed correlation of .70, an alpha point estimate has been found to be .80 for the predictor measure and .80 for the criterion, and a confidence interval around 22 ? has been calculated as +/- .08 (no CI was calculated for the predictor). As can be seen, this 27 correction can make a notable difference, which of course would be even greater if both the predictor and criterion were corrected for unreliability. Table 4 Outcome of Correction for Attenuation Formula Using ? and LB ? Correction for Attenuation using LB ? Correction for Attenuation using Traditional ? LB r r 2211 12* 12 ?? = )72(.80. 70. = = .92 2211 12* 12 ?? r r = )80(.80. 70. = = .875 Using LB ? in place of the traditional estimate can provide insight into how strong the correlation might be given a ?worst case scenario? regarding the reliability of the measure. This could be an effective tool for those revising a measure to be used as a criterion. Comparing newly obtained correlations, using revised editions of the measure, with these estimates can help inform progress. The present study will use these examples to demonstrate the substantive impact replacing the alpha point estimate with ? LB can have on personnel selection practices. The goal of the present research is to draw attention to the shortcomings of the alpha point estimate, present practical applications of recent theoretical advancements in reliability research, and demonstrate just a couple of ways more conservative estimates of reliability can influence the practice of personnel selection. 28 METHODOLOGY Sample The items that make up the composite being reviewed in the present study were part of a larger technical knowledge test on which all candidates who met minimum qualifications for promotion to the rank of sergeant were assessed in both 1999 and 2001. Objections to the results of the 1999 administration were raised in a Federal District Court on the grounds of disparate impact against African Americans. Plaintiffs successfully lobbied for the nullification of the 1999 results and the creation of a new hiring list following the re-administration of the selection procedure, with modifications approved by a Special Master. A Court Order was issued to include various activities leading to revisions of the 1999 selection system and a readministration of the examinations. One hundred and seventy-one police officers (60% Caucasian, 40% African American; 86% Male) completed the same portion of the closed-book examination (with minor ?cosmetic? changes; see Appendix A for examples) at both the 1999 and 2001 administrations. These candidates constitute the sample used in the present study. Measure In 1999 an external consulting firm was awarded a contract to create and administer several promotional examinations for police positions within a large, municipal Merit System in the Southeastern section of the United States. As a basis for 29 test development the consultants conducted a comprehensive job analysis for each rank that selections would be made. The consultant?s job analysis methodology began with site observations, wherein consultant familiarized themselves with the regular duties performed by incumbents of each rank. Small group interviews followed, in which incumbent subject matter experts (SMEs) generated lists of tasks performed within each rank. After a comprehensive list of tasks was created, panels consisting of larger groups of SMEs were assembled to review the lists and provide additional information that might have been overlooked. A survey was then created, which was composed of all tasks identified by the panels. This survey was disseminated to a large group of incumbents who provided individual ratings about the importance and frequency of each tasks? performance. Based upon results of this survey, tasks deemed critical were identified and concentrated upon in further test development processes. In a manner similar to the task analysis, areas of knowledge, skills, and abilities (KSAs) that must be possessed in order to effectively perform the tasks established as critical to the sergeant position were identified. First, a sample of incumbents was guided by the consultants to list every area of knowledge, skill, or ability that they used on the job. This list was then transformed into a questionnaire and presented to all incumbents who held the rank. Each sergeant rated the KSA on importance, the frequency of its application, and whether or not a newly appointed sergeant should possess it before assuming the position. Those KSAs that met the threshold for testing were able to be grouped into one of eight categories: Technical Knowledge, Written Expression, Interpersonal Relations, Information Analysis, Judgment and Decision Making, Planning and Organizing, and Resource Management. These categories were tested via three 30 selection instruments: a Technical Knowledge - Written Test; an In-Basket/Work Sample Test; and an Oral Board Test (see Table 5). As the greatest focus of the current research is placed on the Technical Knowledge component of the selection procedure, only the development of this exam will be further discussed. The technical knowledge test consisted of two components, an open-book and a closed-book portion. The extent to which each area of knowledge was applied through reference or through recall determined whether it was assessed through open or closed- book testing. Police officers have clearly delineated procedures and protocols for numerous situations. Some cases are obscure and/or not very important and need not be committed to memory. Other areas are critical and/or occur regularly and, as such, need to be committed to memory. The areas of knowledge, for which recall was necessary, identified as most important were administrative practices, state and federal criminal codes, and personnel supervisory practices. For the Sergeant technical knowledge test, items were generated by command-level personnel within the Merit System, with guidance and input from consultants of the external firm. The items were then reviewed by other members of the police command-staff, testing specialists from the consultant?s firm, and a linguistic specialist (who reviewed the items for potential biases against protected classes). The final version of the technical knowledge examination consisted of two parts and 103 items; a closed and open-book section, with 59 and 44 items, respectively. The technical knowledge test and the in-basket/work sample exercise were completed by 391 applicants over a two day period in December of 1999. Three-hundred and one applicants returned to complete the oral board component in early 2000. After 31 all assessments were scored and a potential hiring list was reviewed, analyses revealed African Americans would be adversely impacted if selections were based on rank- ordered results. Plaintiff parties objected to the results of the examination, the judge ruled in their favor, and no selections were made from the established promotional list. A Court Order was later directed, in May 2001, to redesign and re-administer the examination in accordance with agreed upon modifications. The consultants contracted to create the 1999 selection process and assessments were retained to complete an updated job analysis to reestablish the tests? content domain and provide opportunities for the plaintiffs, Department of Justice representatives, and other involved parties (e.g., the court appointed a Special Master) to raise objections to material and courses of action during the test (re)design process. The job analysis revealed the duties performed by officers holding the rank of sergeant had not greatly changed since the 1999 analysis. As a result, the examination administered in September 2001 was essentially a replication of the 1999 examination. The three test components used in the 2001 administration were exactly the same as the those used in the 1999 administration in form (i.e., a multiple-choice technical knowledge test, an in- basket/work sample, and an oral board) and function (i.e., presentation and processes followed the same guidelines). The areas of knowledge identified as most critical to performing the duties associated with the position were the same as those for the 1999 administration (Criminal Code, administrative practices, and personnel supervision practices), with the addition of Constitutional Law. Once again, technical knowledge was assessed with a two-part written test, open and closed-book. Only 23 items appeared on the 2001 closed-book technical knowledge test. The presentation of these items on the 32 2001 administration were nearly identical to their original form on the 1999 administration (see Appendix A for examples of changes). Three-hundred and sixteen candidates completed the open and closed-book tests of technical knowledge in late September 2001. Of those 316 candidates, 171 had also completed the 1999 technical knowledge test. This group constitutes the sample for the current study. One hundred and three individuals classified themselves as White/NonHispanic, while 68 stated they were African American and the group was predominantly male (N = 148). Of the 23 items that appeared on the 2001 administration of the TK Written Test, 14 had also appeared on the 1999 administration. These 14 items constitute the composite that serves as the main focus of the present study?s analyses. Analyses Based upon the history of the testing process and the fact that only the items from the 2001 administration were used for hiring decisions, the results from this group items will be the primary focus of analysis. Using the sample and composite described above, coefficient alpha (Equation 4), as it is traditionally calculated, will be computed for the 2001 test to provide a base point of internal consistency. In addition, a test-retest correlation (Pearson) will be produced to indicate the temporal stability of the scores from the first to second administration. Green?s (2003) test-retest alpha (Equation 9) will then be calculated and compared to these statistics. The presence of transient error will be exposed if the traditionally calculated alpha (Equation 4) for the 2001 administration is greater than test-retest alpha (Equation 9). Following these analyses, alpha?s standard error will be calculated according to Duhachek and Iacobucci?s (2004) method, discussed above (Equation 11). Confidence intervals, based on ASE (Equation 12), will be 33 produced and the location of the various estimates will be presented in relation to the interval. The upper and lower boundaries of alpha?s confidence interval will then be substituted in place of the traditional point estimate for several calculations commonly utilized within the field of personnel selection. First, effects of increasing the test length using both the alpha point estimate and ? LB in the Spearman Brown prophecy formula (Equation 13) will be compared. Derivations of the formula will be presented with comparisons being drawn to the results produced by the two statistics. Second, the observed correlation between the technical knowledge composite and each of the other selection measures will be calculated. The observed correlations will be corrected for attenuation due to measurement error (Equation 14) using both traditional alpha and LB ? . Additionally, the upper boundary of alpha?s confidence interval (? UB ) will be substituted in the equation, as the use of this statistic is likely to produce the most conservative estimate. The resulting correlations will be presented and compared. Finally, comparisons of results applying both the point and lower bound estimates of alpha to Cascio et al.?s (1991) SED banding formula will be presented (Equations 7 & 8, respectively). Differences between bandwidths and the resultant probabilities of minority selections will be demonstrated. Operating under the assumption that within band selections will be made by a (hypothetical) secondary selection device neutral with respect to its effect on racial group membership (i.e., selection rates are random, based on the proportion of each race within the sample), adverse impact analyses will be conducted for each approach (top-down, SED banding, SED ? LB banding) and compared. 34 RESULTS Coefficient alpha for the 1999 test was calculated to be .333, while coefficient alpha for the 2001 administration was .373. While the level of internal reliability was relatively consistent between the administrations this does not suggest the scores are highly correlated. In fact the correlation coefficient, which shows the temporal consistency between scores on the two administrations, was only moderate (r = .436). Table 6 displays the descriptive statistics for the two tests, while Appendix B presents the covariance matrices for the 1999, 2001, and combined 1999/2001 data. Table 6 Descriptive Statistics for 1999 and 2001 Administrations Administration N Minimum Maximum Mean Std. Dev. Alpha 1999 171 5.00 14.00 10.16 1.84 .333 2001 171 7.00 14.00 11.50 1.58 .373 Test-retest Alpha Green?s test-retest alpha was calculated following Equation 9. 21 ? XX ? = 357. )58.1(84.1 )0053(.14 ?? ? 2 ' 2 21 21 == xx xjxj k ?? ? Since the test-retest alpha is lower than the traditional alpha calculated for the 2001 test, the presence of transient error could be a factor. Though the difference 35 between the statistics is not large, it would not be known to what degree transient error might effect the alpha calculation if the 2001 administration was not the result of the 1999 results being challenged. As such, in order to make the most conservative judgment about the test?s reliability and the most cautious application of the statistic in other formulae, the calculation of a confidence interval around the point estimate was produced. Alpha?s Standard Error and Confidence Interval Alpha?s Standard Error (ASE) was calculated to equal .07, following the syntax provided by Duhachek and Iacobucci (2004; located in Appendix C). Using a 95% confidence interval, alpha?s lower boundary (? LB ) was calculated to equal .236 and alpha?s upper boundary (? UB ) was calculated to equal .510. Figure 2 clearly shows the confidence interval encapsulates the various reliability estimates. Figure 2 Confidence Interval and Other Reliability Estimates ? LB .357 ? .436 ? UB .236 test-retest ? .373 test-retest r .510 Alpha and the Spearman-Brown Prophecy Formula In cases where the reliability levels are less than desirable (such as the current study) and the researcher would like to know the extent to which increasing the number of items on the measure would improve reliability, the Spearman-Brown prophecy formula can be a useful tool. Table 7 presents a comparison between the estimated effect of doubling the test length (i.e., adding 14 items to the composite investigated in the current study) using both traditional alpha and ? LB in the SBPF. 36 Table 7 SBPF Using Traditional ? and ? LB SBPF Using Traditional ? SBPF Using LB ? New ? = ? ? )1(1 )( ?+ i i = 373.)12(1 373.)2( ?+ = .54 New ? = LB LB i i ? ? )1(1 )( ?+ = 236.)12(1 236.)2( ?+ = .38 As can be seen, the result produced by applying the SBPF using alpha as it is traditionally calculated yields an estimate nearly 1.5 times that of the estimate using ? LB (.54 versus .38, respectively). Yet both calculations produce less than desirable levels of internal consistency, so algebraically manipulating the SBPF to determine the number of items necessary to reach such a level, such as .80, could be more informative (see Table 8). Table 8 SBPF Using Traditional Alpha and LB ? (solving for i) Using Traditional ? Using Traditional LB ? () ()?? ?? new new i ? ? = 1 1 () () 72.6 373.80.1 80.373.1 = ? ? = ( ) () LBnew newLB i ?? ?? ? ? = 1 1 ( ) () 95.12 236.80.1 80.236.1 = ? ? = 37 Applying alpha as it is traditionally calculated would suggest that approximately 94 items [6.72 x 14(the original number of items)] would need to be included in order to reach the .80 threshold, while using ? LB estimates over 180 items are needed to reach that level of internal consistency. Since it is often unrealistic to create a test with so many items a final calculation is worthwhile to project what the ?worst case scenario? of using 94 items might be. Ninety-four items is approximately 6.72 times the amount that originally made up the composite so applying this to the SBPF using ? LB figures .67 as the likely (?worst case?) level of reliability should the developer decide to use that amount of items. ( ) ( ) 67.236.172.61/236.72.6 =?+=?New Alpha and Correction for Attenuation Another calculation that uses coefficient alpha is the correction for attenuation formula (Equation 14). As mentioned, as part of the promotional testing procedure candidates completed two assessments in addition to the technical knowledge written test, a structured oral interview and an in-basket/role play exercise. The correlation between the 2001 composite and the latter exercise was .203 (p < .001) while the correlation with the former was not significant (r = .015, p = .848). Therefore only the relation between the 2001 composite and the in-basket/role play will be investigated. The alpha level of the exercise (? IB ) will remain constant through this example 8 (though corrections for this variable could be appropriate as well) while the upper and lower boundaries of the confidence interval will take the place of the alpha coefficient for the technical knowledge written test (? WT ). 38 8 While candidates? scores on the in-basket/role play exercise were available their component scores on the exercise were not available. The .67 alpha level comes from the technical manual and represents the whole sample internal consistency of the exercise. Table 9 Correction for Attenuation using Traditional Alpha, ? LB , and ? UB Using Traditional ? WTIB r ?? 12 41. )373(.67. 203. = Using ? LB LB WTIB r ?? 12 51. )236(.67. 203. = Using ? UB UB WTIB r ?? 12 35. )510(.67. 203. = Table 9 shows that correcting the coefficient for attenuation due to measurement error in the composite produces a true correlation of .41 using (traditional) alpha. Substituting ? LB instead suggests the linear relation between the two variables could be as high as ? = .51; while inserting ? UB in place of coefficient alpha estimates that the variables are only linearly related at ? = .35, after correcting for measurement error. Alpha and Test Score Banding Appendix D presents the actual list of candidates, their rank-ordered placement based upon the 2001 composite, and their race. A quick glance at the table shows that a disproportionate number of White candidates are at the top of the distribution. When such results are encountered the SED banding technique may help alleviate the degree of adverse impact associated with the selection process. Table 10 presents the calculation of SED bands using both traditional alpha and ? LB . 39 Table 10 SED Bands using Alpha and ? LB SED Band SED ? LB Band Calculation [ ] []74.2)627(.58.177.2 )1(2)96.1( = ??? x [] []35.3)764(.58.177.2 )1(2)96.1( = ? LBx ?? True Band 14.00 ? 11.26 14.00 ? 10.65 Applicable Band 14.00 ? 12.00 14.00 ? 11.00 An SED band using coefficient alpha as it is traditionally calculated creates a range of indifference spanning 2.74 points, which translates to a score of 11.26. Since all scores are integers of whole numbers the band includes all scores of 12 and higher, while scores of 11 and lower fall outside of the band. Substituting ? LB in place of traditional alpha results in a wider bandwidth. The SED ? LB band is equal to 3.35 points, which translates to a score of 10.65. In this case all of those who achieved an 11 or greater on the composite will be included in the band, while those who scored a 10 or lower will not have the opportunity to be selected. For the purposes of the following example, assume a secondary assessment device is employed to make within band selections that result in decisions that are random with respect to race. Table 11 presents the likely number of candidates that would be selected from each racial group following a small (i.e., 7.5%), medium (i.e., 30%) and large (i.e., 55%) selection ratio according to top-down, SED banding, and SED ? LB banding selection approaches. 40 Table 11 Racial Composition of Selected Test-takers by Selection Ratio Top-down SED Band SED ? LB Band Select Ratio Whites African Americans Select Ratio Whites African Americans Select Ratio Whites African Americans 7.5% 12 1 7.5% 9.5 3.5 7.5% 9 4 30% 42 9 30% 38 13 30% 34.5 16.5 55% 71 24 55% 71 24 55% 64.5 30.5 The level of adverse impact will be contingent upon the approach that is followed. Table 12 presents routinely calculated adverse impact statistics (4/5 th s Rule calculations and results of Fisher?s Exact tests) for each of the three methods. As can be seen, both banding techniques greatly reduce the level of adverse impact associated with the test, though it is still present in many cases. However, substituting ? LB in the SED equation creates wider bands that capture more African American candidates, which increases the opportunity of selection and lowers the degree of adverse impact associated with the selection process. While the 4/5 th s rule is still violated in every instance, results of the Fisher Exact tests reveal differences among the three techniques. While the differences in selection rates are not significant at the 7.5% selection ratio using both the traditional and SED ? LB bands, SED ? LB bands are the only technique that does not result in statistically significant differences at the 30% ratio (while the standard normal deviate is only one hundredth of a point over the adverse impact threshold of 1.96 at the 55% selection ratio). 41 Table 12 Adverse Impact Calculations by Selection Ratio Top-down SED Band SED ? LB Band Select Ratio 4/5 th s Rule Ratio Std. Normal Equiv Select Ratio 4/5 th s Rule Std. Normal Equiv Select Ratio 4/5 th s Rule Std. Normal Equiv 7.5% 13% -2.30 7.5% 45% -.98 7.5% 67% -.38 30% 32% -3.80 30% 52% -2.35 30% 69% -1.30 55% 51% -4.20 55% 51% -4.20 55% 73% -1.97 42 DISCUSSION Recent research has led to the reexamination of long held assumptions regarding the interpretation of the most commonly presented reliability estimate used in personnel selection research and practice, Cronbach?s (1951) Coefficient Alpha. The present study was designed to connect a number of recent advances in this field of research and offer personnel practitioners insight towards the ways these advancements may be applied to their practice. The main area of investigation centered on the effect of transient error, response variations that are due to random changes in test-takers? psychological states across time. A major assumption underlying the alpha coefficient is that it represents the lower boundary of reliability for the results of a measure. Yet, recent theoretical (Komaroff, 1997) and empirical evidence (Becker, 2000) has demonstrated that if errors associated with a measure?s items are correlated, the reliability coefficient produced using the alpha calculation can be an inflated estimate of reliability. Transient errors, which are likely to be present in a wide range of testing situations, have the potential to create such a violation. The current study presents the results of a unique data set, where information from an actual selection process yielded the information necessary to identify the influence of transient error. The data also provided the opportunity to demonstrate methods that can protect personnel selection processes against the potential difficulties that result from the influence of transient error. 43 Test-retest alpha Several models have been proposed to detect the likely presence of transient error (Becker, 2000; Schmidt, et al., 2003). The current research implemented the test-retest alpha statistic, recently presented by Green (2003), which can reveal the presence of transient error when compared to the traditionally calculated alpha statistic. Because test- retest alpha not only takes within-test/different-item covariances (like traditionally calculated alpha) and between-test/same-item covariances (similar to the test-retest calculations) into account but also the between-test/different-item covariances, this statistic captures all the relevant sources of error assessed through classical test theory. While measures of internal consistency (such as coefficient alpha) can capture variance associated with random response error and specific factor error and measures of stability (such as the test-retest correlation) can account for random response error and transient error, test-retest alpha can account for the variance from all three sources. Transient error, the only source of error not accounted for by alpha, can be identified by subtracting alpha from test-retest alpha. When alpha is larger than test-retest alpha transient error is likely to be inflating the reliability estimate. When alpha is smaller than or equal to test- retest alpha transient error is likely to have a negligible effect on the estimate. Using data from candidates who completed the technical knowledge portion of a promotional examination for the rank of sergeant, in both 1999 and 2001, test-retest alpha was calculated to be .357. Comparing this statistic to the alpha calculation for the composite from 2001 (.373) and the test-retest correlation between the 1999 and 2001 administrations (.436) reveals test-retest alpha is the lowest estimate among the three. This suggests transient error may have inflated alpha as an estimate of reliability for the 44 2001 administration. Though the example provided in the current study demonstrates transient error can be a factor that effects the calculation of coefficient alpha, the difference is small (.016). While it might be easy to dismiss such a small discrepancy as an insignificant factor that would not impact the way the test is viewed, this is only a single sample and should not suggest that transient error in innocuous. There is no particular value where inflation will impact interpretation of test results. In some cases a few hundredths of difference might be influential, while in others a couple tenths could have no practical effect. The lack of concrete guidelines to gage the influence of transient error may discourage practitioners who wish to control for its effects, but the problem is quite insignificant considering the fact this source of error can almost never be identified in most testing situations. Practitioners who create and analyze the results of tests from a single administration would not have the data to generate test-retest alpha, or any other statistic that can be used for similar purposes (e.g., Becker, 2000, Schmidt, et al., 2001), thus the extent to which alpha is being over- or underestimated cannot usually be known. To compensate for this difficulty the present study calls for the confidence interval to be presented and used more regularly in personnel selection research and practice. Confidence interval alpha In personnel selection contexts the validation of tests are often demonstrated through a content-based approach, where documentation of incumbent subject mater expert ratings of the test material serves as evidence that the assessment is appropriate. This is in contrast to the statistical information that is produced following construct and/or criterion-related validation approaches. The absence of additional statistical information 45 places a greater weight on the reliability estimate as a diagnostic instrument. This lone statistic could be the only factor used to interpret candidates? scores and make selection decisions. With so much emphasis placed on a single estimate it is imperative testing professionals communicate the degree of precision associated with the calculation. While the alpha point estimate is often the best single estimate to consider when interpreting the reliability of results from a test, additional diagnostic information is readily available to be presented along with the statistic. The present study echoes the call of recent researchers (e.g., Duhachek & Iacobucci, 2004) to supplement the alpha point estimate with additional information, such as the standard error of the calculation and a confidence interval. Adding this information generates a visible reminder that coefficient alpha is not a perfect indicator of test scores? true reliability. In the current study the upper and lower boundaries of alpha were computed using the statistic and formula developed by Duhachek and Iaccobucci (2004). Alpha?s Standard Error (ASE), which was found to equal .07, is based upon the distribution of standard error surrounding coefficient alpha and serves as the basis for creating the confidence interval. The resultant confidence interval, which ranges from .236 (? LB ) to .510 (? UB ), suggests that if 100 samples were taken of this composite, the alpha coefficient would be calculated to fall within those upper and lower boundaries 95 times out of 100. It is suggested that this information cannot only be influential in decision-makers? interpretation of test results but can also help inform personnel practitioners in creating and utilizing the tests they develop. 46 Utilizing the Confidence Interval Beyond its use as a measurement for the level of internal consistency, coefficient alpha also serves as the anchor statistic in a variety of formulae used to project, interpret, and apply test results. If the estimate is inaccurate, so too are the results of whatever formula used the estimate in its calculation, which in turn could lead to inaccurate interpretations and applications. In high stakes settings, such as personnel selection, it is prudent to err on the side of caution. If the possibility exists that alpha overestimates the reliability of a set of results, because of the effect of transient error, and calculations exist to formulate a more conservative statistic, caution can and should be exercised. The present study demonstrates the benefits of substituting the upper and lower boundaries of alpha?s confidence interval in place of the point estimate in a variety of calculations common to personnel selection. First, the Spearman-Brown prophecy formula, which informs researchers of the extent to which increasing the number of items on their measure would improve reliability, was used as an example to demonstrate the effects of substituting ? LB in place of alpha as it is traditionally calculated. The results were quite informative. As Table 7 presents, when projecting the reliability level that results from doubling the current number of items on the composite, the result produced by applying the SBPF using alpha as it is traditionally calculated yields an estimate nearly 1.5 times that of the estimate using ? LB . Applying alpha as it is traditionally calculated would suggest that doubling the test length from 14 to 28 items would improve reliability from .37 to .54 (a 46% increase), while substituting ? LB in the equation suggests the improvement would be one hundredth of a point, from .37 to .38 (less than a 3% increase). Basically, the result of 47 the ? LB substitution suggests the test length would need to be doubled in order to assure the new reliability level will not be less than the original point estimate. Manipulating the SBPF to calculate the number of items necessary to reach an ?acceptable? level of reliability, in the present case .80, also provides a marked difference between alpha and ? LB . Using alpha as it is traditionally calculated suggests approximately 94 items would need to be included in order to reach the .80 threshold, while using ? LB estimates over 180 items would be needed to reach that level of internal consistency. Since it is most likely unrealistic to create a test with so many items, it was demonstrated that other derivations of the SBPF can utilize the ? LB substitution to provide additional information, such as what the ?worst case scenario? of using a certain amount of items on a test might be. Using the current example, 94 items was determined as the number of items necessary to reach an alpha level of .80, when alpha as it is traditionally calculated was applied to the formula. Since 94 items is 6.72 times the original amount, applying this number to the SBPF using ? LB estimates .67 as the projected level of reliability. A researcher who conducted such calculations could then be reasonably sure the alpha level that will result from including 94 items on the test will not be below .67., though it is more likely to be near .80. The results of the current study demonstrate that the discrepancy between results of the SBPF using alpha as it is traditionally calculated and ? LB can be significant, but whether the SBPF over- or underestimates the new reliability level will depend on which assumptions underlying the alpha calculation are violated (essential tau-equivalence and/or uncorrelated error). Of course finding a higher than expected level of reliability is not undesirable, but a lower than expected result could have severe implications for test 48 development planning. For example, a researcher could develop a test, conduct a pilot study, calculate the alpha level of results, perform the SBPF calculation to determine the effect of doubling the amount items that appeared on the pilot version, create the new items, re-administer the test, and find a completely different reliability level than predicted by the traditional formula. To demonstrate the practical effect of overestimating reliability predictions consider the following. The test-retest data used in the present study was the product of a successful court challenge to the results of the 1999 administration. The consequence was a hiring freeze where no officers were promoted to the rank of sergeant after the 1999 test. The two year period during which no promotions were made undoubtedly placed a greater strain on the existing group of sergeants. The effect of the additional strain could have led to a decrease in the sergeants? performance, which in turn could have led to decreases in arrests, convictions, and/or increases in crime. Although it was not so in the present case, an unacceptable level of reliability can serve as grounds for challenging the results of a selection process. If one of the challenges brought against the results of the 1999 test was a low level of reliability and the researcher increased the number of items on the 2001 administration to reach a certain level of internal consistency based upon the SBPF using alpha as it is traditionally calculated, an unexpectedly low alpha level could have once again been obtained providing additional grounds for challenge. Although operating under a court ordered consent decree is an atypical situation for most organizations and serious consequences, such as increased crime rates, might seem far removed from the calculation of a reliability estimate, the degrees of separation 49 between imprecise calculations and real world negative effects are actually quite small. There is path that begins with test interpretations and ends with customer impressions. A reliability estimate informs users of a selection process? value, while the selection process informs organizations of the extent to which the candidates they choose will perform effectively, and the candidates who are hired to perform for the organization determine the worth of products or services provided to the public. Most business models will plot a straight course that leads to the goal of providing customers with products or services that are well received, but if the first step is askew the path will be off-mark. Since no models, tests, or statistics are perfectly accurate, a conservative approach should be adopted to guard against over-projections (again, exceeding projections is not as undesirable). Test development procedures based upon cautious calculations, such as substituting ? LB in the SBPF, allow personnel practitioners to protect the organizations they serve against less than desirable test results, which will help assure the quality of all subsequent decisions and successive outcomes. Correction for Attenuation The second example of utilizing the boundaries of alpha?s confidence interval focused on the role of the reliability coefficient in correcting correlation coefficients for attenuation due to measurement error. Only the results from the in-basket/work sample exercise resulted in a significant correlation with the composite, though it was a low correlation (.206). Table 9 presents the results of correcting the correlation coefficient using alpha, ? LB , and ? UB and, once again, the results are notable. Using alpha as it is traditionally calculated resulted in a revised estimate of the correlation equaling .41, while substituting ? LB produced a coefficient that was approximately 25% higher (.51). 50 While substituting ? LB could be warranted if the researchers knew that transient error was inflating the estimate, as mentioned, it is rather unusual to have the test-retest data necessary to make this determination. If ? LB is used without this information the ?corrected? correlation coefficient could be a severely inflated estimate. A more cautious approach would be to instead apply ? UB . In the current example, the corrected correlation coefficient using ? LB was calculated to equal .35. Obtaining all three estimates, the conservative estimate using ? UB (.35), the primary figure using the alpha point estimate (.41), and the liberal estimate using ? LB (.51) would, of course, be the most informative. The additional information provided by substituting the confidence interval boundaries in place of the point estimate when correcting correlation coefficients for attenuation due to measurement error could be resourceful for test developers engaged in establishing the construct validity of a new measure, for researchers comparing results among studies, and for authors presenting results of newly investigated relations among variables. The purpose of introducing this example, as well as the SBPF example, is to demonstrate that the lack of precision in reliability estimates should not remain a merely theoretical interest. Reliability estimates are included in a great number of commonly applied formulae, which effect the ways tests are created and utilized and their results interpreted. The final area of investigation may best demonstrate the degree to which reliability estimates can exert salient effects on personnel selection decisions. SED Banding The topic of SED banding has a very divisive history. Though there are a number of strong supporters who have tirelessly endorsed the practice in the professional literature and an almost consensus opinion that the creation of bands in and of themselves 51 (that is without discussion of band width or the ways within band selections are made) is a legally defensible practice, a strong group of critics also exist who challenge the use of the technique. Many of the criticisms are directed at the theoretical and logical grounds that serve as the technique?s basis. At the core of those grounds is the argument that it is reasonable to create ranges of indifference among scores because no test is a perfect measurement device. The proof of this assertion is the fact that reliability coefficients are almost never equal to 1.00. While this fact provides a very strong argument for the general strategy of banding, the specific mechanics of the approach, i.e. determining a range of indifference, must also be defended. As Equation 8 demonstrates, along with the standard deviation of the test scores, the bandwidth will differ from test to test as larger and smaller reliability coefficients, which are routinely coefficient alpha, are inserted into the formula. If the purpose of the creating the band is to assure those scores that lie outside the band are significantly different (based upon a predetermined level of confidence) from those within the band, precision of the alpha coefficient is paramount. If the alpha coefficient is an underestimate, the bandwidth will be larger than it needs to be. Thus, some of the lowest scores incorporated within the band should actually be left outside of it. If the alpha coefficient is an overestimate, the bandwidth will not be large enough. In this case, some of the highest scores that lie just outside of the band should actually be incorporated within it. Incorrectly stating a score outside of the band is significantly different from the top score within the band is similar to conducting a Type I error in research, while including a score within the band that should actually remain on the other side is similar 52 to creating a Type II error. The former will occur when an alpha coefficient that overestimates reliability is included in the SED band equation. In such cases individuals who lie just outside of the band are treated as being significantly different from the referent score (similar to rejecting H O ) though they are not (statistically significantly different based upon the prescribed confidence level). Conversely, applying an alpha coefficient that underestimates reliability is comparable to a Type II error. Here some of the individuals within the band are significantly different from the referent scorer, though they are treated as being equivalent (similar to retaining H O ). In personnel selection settings, just as any other, a choice must be made whether to favor committing a Type I or a Type II error. The decision to use a potentially wider than needed band (following a Type II error) over a potentially narrower band (following a Type I error) should be heavily influenced by the repercussions from each choice. For instance, while the owner of a new retail store may comfortably commit a Type II error when selecting security guards, because the greatest repercussion will likely be small amounts of shoplifting or loitering teenagers, the owner of a new sightseeing service should not feel as comfortable committing this type of error when selecting pilots to carry clients, because the repercussions in this case include loss of life and extremely expensive equipment. Since most scenarios lie somewhere between these extreme examples the choices are usually more difficult and how well employees perform their job is only one criterion to consider when making a hiring or promotion decision. There are a host of other considerations that should be weighed such as employee development and career progression, public image and relations, as well as diversity and affirmative action goals. 53 Generally, creating wider bandwidths provides greater opportunity for minority selections, though this depends upon the degree to which the secondary selection device (used for within band selections) effects the selection rates of minority groups. The present study demonstrates that when a secondary selection device has a random effect on racial selections, the wider the band the greater the opportunity for minority selections. While using an SED band as it is traditionally calculated (using the alpha point estimate) would result in a marked increase in minority selections over strict top- down selections (particularly with smaller selection ratios), expanding the band by substituting ? LB in place of the alpha point estimate creates even greater opportunity. Table 12 demonstrates the impact these techniques have on adverse impact calculations. While the 4/5 th s Rule remains violated following every technique at every selection ratio, the proportion of minority to majority selections greatly improves with the creation of traditional SED bands, and to an even greater extent using SED ? LB bands. Looking at the 30% selection ratio, the 4/5 th ratio using top-down selection is 32%, but increases to 52% after applying the SED band (a 162% improvement) and 69% after applying the SED ? LB band (a 216% improvement). The results are even more dramatic for the smaller selection ratio (7.5%), where a 13% minority to majority selection ratio exists following top-down selections. In this case, using an SED band improves that ratio over 300% (4/5 th s calculation = 45%), while an SED ? LB band improves upon the top-down ratio by over 500% (4/5 th s calculation = 67%). As 4/5 th s Rule calculations are heavily dependent upon sample size the best professional practice calls for a statistical significance test to supplement the figure. In the present study the Fisher Exact test was employed to test the null hypothesis that no 54 significant differences exist between the selection rates of minority and majority applicants. Table 12 presents the results of these tests as well. When the standard normal equivalent exceeds 1.96 the selection rates between the two groups are considered to be statistically significantly different. As can be seen, following top down selection at the 7.5% selection ratio produces a result that crosses this threshold, demonstrating there is a statistically significant difference between the selection rates of African Americans and Whites. When both the SED band and the SED ? LB band are applied the standard normal equivalent drops below one, revealing the difference between selection rates for the two groups could be due to chance. At the 7.5% selection ratio both the SED and SED ? LB bands would be successful means to combat adverse impact (as detected by the Fisher?s Exact test). However, at the 30% selection ratio only the SED ? LB band would be effective in this manner. At the 55% ratio note the SED band produces a result equal to the top-down procedure, a statistically significant standard normal equivalent of -4.20, while the SED ? LB band just barely crosses the 1.96 threshold. The results of the present study clearly demonstrate that substituting the lower bound alpha in place of the alpha point estimate expands SED bands to a point where a substantive improvement can be observed when assessing adverse impact. While the ? LB substitution in the SED banding formula can be seen as one of many ways the confidence interval created around alpha can be utilized by personnel practitioners, it is quite different from most other examples and thus deserves further discussion. Unlike the correction for attenuation and Spearman-Brown prophecy formula examples, where the ? LB substitution can be performed solely for informational purposes, applying the SED banding modification proposed here will very often have immediate 55 real world effects (e.g., some individuals will be given the opportunity for promotion while others will not). Given the amount of controversy that surrounds the SED banding technique in general, the question of how well this modification will be accepted by the community of personnel practitioners should be addressed. Judging the Use of ? LB as a Professional Practice There are three main criteria for acceptance upon which the approach will likely be judged. The first and foremost issue to be weighed is the logic underlying the modification. Is it reasonable to substitute the lower boundary from the confidence interval surrounding alpha in place of the point estimate? Based upon recent research that has demonstrated transient errors can cause a violation of the uncorrelated errors assumption and lead to an inflated reliability estimate (via coefficient alpha) in almost any testing situation, the modification to the SED banding technique is not only intuitively reasonable but fully in line with the logic and purpose of applying the technique. SED bands are formed to ensure individuals with the requisite knowledge, skills, and abilities for successfully performing the duties of the position being tested are not passed over due to the presence of measurement error in the instrument used to make the assessment. Due to the manner in which the bandwidth is calculated (i.e., using variability and reliability as the major determinants), if alpha were overestimated the bands would be narrower than intended. This outcome may in turn create a situation wherein individuals with true levels of competency equal to the individual with the highest score on the test may not be given the opportunity to be selected not because of his or her true level of knowledge, skills, or abilities but due to error in measuring those traits. The use of ? LB helps makes certain this does not occur by creating a band with 56 confidence that the reliability estimate used in the equation is not inflated. The second criterion against which practitioners will likely weigh the modification is the efficacy of the technique. While it is reasonable to use the revision on the bases of logic and reason, it is likely practitioners would only be interested in adopting it if it were demonstrated to assist organizations in improving their diversity and/or reducing adverse impact. The present research supports the notion that SED ? LB bands can indeed lead to these outcomes. Both the 4/5 th calculations as well as the results from the Fisher Exact tests demonstrate that SED ? LB bands produce greater opportunity for minority selection, which reduces the likelihood of adverse impact and increases the potential for a diverse organization. Of course, like all banding techniques, the utility of the selection procedure will decrease, though the modification will likely lead to a greater decrease than that resulting from traditionally calculated SED bands. As discussed, a decision must be made whether to favor the integrity of the selection procedures or the advancement of diversity goals. If an organization is determined to choose the latter, SED ? LB banding represents a logically sound alternative that can improve the probability of protected class selections. However, future research should investigate the tradeoff in utility between traditionally calculated SED bands and the proposed modification so that practitioners can be fully informed when assisting an organization with the decision. The third and final criterion is unfortunately the most difficult to assess, that is the likelihood the modification will be accepted by the courts. An in-depth discussion of this matter is beyond the scope of the present work, but as the acceptability of the modification is inherently linked to the acceptability of the SED banding technique in general, previously published literature covering this topic may be instructive (e.g., 57 Barrett & Lueke, 2004). Since the approach has a greater likelihood to assist minority groups, discrimination lawsuits initiated by protected classes in opposition to the technique would be less likely to arise than those following top-down selection, or even those following traditionally calculated bands. However, the chances that charges of reverse discrimination being filed by white applicants with the top rank-ordered scores could increase beyond those encountered using both top-down selection and traditionally calculated SED bands; the wider the band the greater the potential for this type of challenge. Therefore Guion?s (2004) advice about determining the width of the range of indifference using reason that can be ?articulated well enough to be made a matter of written record? (p. 56) becomes a critical issue. Fortunately the argument for using SED ? LB banding is uncomplicated. There are no perfect tests; all contain measurement error that affects interpretation of results, especially differences among scores. In a context as important as personnel selection, the accuracy of decisions to favor some individuals over others is crucial. If decision makers cannot be confident about the distinctions they would like to draw, creating a range of indifference, wherein all scores are considered equal, is a reasonable means of compensation. The SED banding technique calculates this range using the variability and reliability of the obtained results (which are at the root of the uncertainty in the results). In order to exercise the greatest amount of caution while drawing distinctions among applicants, the bandwidth is determined by employing the most conservative estimate of reliability that is available, ? LB . Though the technique could still be attacked for the same reasons traditional SED banding has been criticized (e.g., it does not represent the ?best practice? for selection procedure utility), the modification should not introduce any new 58 data are available. The only practicable alternative in most cases is to compensate for the imprecision of the coefficient. The most direct way to accomplish this is to offer a confidence interval that surrounds the alpha point estimate. Presenting this supplemental information is the best means to affect decision makers? understanding of the statistic, thereby influencing interpretation of the test results they have obtained and the personnel decisions they must conclude. The upper and lower boundaries of alpha?s confidence interval not only provide valuable information to be used in the interpretation of the statistic but are sound estimates of reliability in their own right that researchers and practitioners can apply to other formulae they commonly employ. The examples provided in the present study demonstrate the advantages of substituting the confidence interval boundaries in place of the point estimate. These advantages stem from practicing caution in application and being conservative in interpretation. Application of the social sciences, such as the design of personnel selection procedures, can never be conducted with the same degree of accuracy as the more concrete, natural sciences. Yet, so long as the degree and sources of inaccuracy are never concealed, advancements will continue and shortcomings can be minimized. 60 obstacles to judicial acceptance. Conclusion The overarching theme of the present research is conservatism. One of the reasons coefficient alpha has been so widely applied and accepted is due to the long held notion that the calculation presents the most conservative estimate of reliability. However, in light of recent research that has demonstrated factors exist (i.e. transient error) that can cause coefficient alpha to present an inflated estimate of reliability, professionals must reassess the meaning they assign to the venerable statistic and the influence it has on their practice. Three logical alternatives exist: 1) coefficient alpha can be abandoned for other reliability estimates that better account for these factors, 2) corrections can be made to the calculation to prevent the overestimation, and 3) compensatory techniques can be employed to offset potential shortcomings when applying and interpreting the statistic. Though the first alternative is certainly a viable option, coefficient alpha is so widely used and recognized that its abandonment could have negative repercussions on a wide array of research and practice. For example, while reliability estimates following Generalizability Theory would be appropriate (and much more informative) in many personnel selection contexts, for better or worse a significant amount of personnel practitioners are not as familiar with the methods of conducting a g- study, and even fewer decision makers within the organizations they serve could well interpret the results meaning. Unfortunately, the means to correct the calculation are usually unavailable, leaving the second alternative an unviable option as well. Test-retest alpha is an example of a ?corrected? coefficient alpha but the statistic can only be computed when test-retest 59 REFERENCES Americans with Disabilities Act of 1990. 1990. P. L. 101-336. Barchard, K.A. & Hakstian, A.R. (1997). The robustness of confidence intervals for coefficient alpha under violation of the assumption of essential parallelism. Multivariate Behavioural Research, 32(2), 169-191. Barrett, G.V. & Lueke, S.B. (2004). Legal and practical implications of banding for personnel selection. In H. Auginis (Ed.). Test-score banding in human resource selection: Technical, legal, and societal issues. Praeger Publishers/Greenwood Publishing Group, Westport, CT. Barrick, M. R., & Mount, M. K. (1995). The Personal characteristics inventory manual. Unpublished manuscript, University of Iowa, Iowa City. Baugh, F. (2002). Correcting effect sizes for score reliability: A reminder that measurement and substantive issues are linked inextricably. Educational and Psychological Measurement, 62, 254-263. Becker, G. (2000). How important is transient error in estimating reliability? Going beyond simulation studies. Psychological Methods, 5, 370 ? 379. Becker, G., & Cherny, S. S. (1994). Gender-controlled measures of socially desirable responding. Journal of Clinical Psychology, 50, 746-752. Buss, A. H., & Perry, M. (1992). The Aggression Questionnaire. Journal of Personality and Social Psychology, 63, 452-459. 61 Campion, M. A., Outtz, J. L., Zedeck, S., Schmidt, F. L., Kehoe, J. F., Murphy, K. R., & Guion, R. M. (2001). The controversy over score banding in personnel selection: answers to 10 key questions. Personnel Psychology, 54, 149-185. Cascio, W. F., Goldstein, I. L., & Outtz, J. (1995). Twenty issues and answers about sliding bands. Human Performance, 8, 227-242. Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implications of six methods of test score use in personnel selection. Human Performance, 4, 233- 264. Charter, R. A., & Feldt, L. S. (2002). The importance of reliability as it relates to true score confidence intervals. Measurement and Evaluation in Counseling and Development, 35, 104-112. Civil Rights Act of 1991, 42 U.S.C. ?2000e-2(1) Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98-104. Crano, W. D., & Brewer, M. B. (1973). Principles of research in social psychology. in Crocker, L., & Algina, A. (Eds.) (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich College Publishers, New York. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich College Publishers, New York. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. Duhachek, A., Coughlan, A. T., & Iacobucci, D. (2005). Results on the standard error of the coefficient alpha index of reliability. Marketing Science, 24, 294-301. 62 Duhachek, A. & Iacobucci, D. (2004). Alpha?s standard error (ASE): an accurate and precise confidence interval estimate. Journal of Applied Psychology, 89, 792- 808. Fan, X., & Thompson, B. (2001). Confidence intervals about score reliability coefficients, please: An EPM guidelines editorial. Educational and Psychological Measurement, 61, 517-531. Feldt, L. S., & Ankenmann, R. D. (1999). Determining sample size for a test of the equality of alpha coefficients when the number of part-tests is small. Psychological Methods, 4, 366- 377. Fiske, D. W. (1966). Some hypotheses concerning test adequacy. Educational and Psychological Measurement, 26, 69-88. Goldberg, L. R. (1997). A broad-bandwidth, public-domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F.De Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7-28). Tilburg, the Netherlands: Tilburg University Press. Green, S. B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88-101. Guion, R.M. (2004). Banding: Background and general management purpose. In Aguinis, H. (ed.), Test-score banding in human resource selection: Technical, legal, and societal issues. Praeger Publishers: Westport, CT. Hakstian, A.R., & Whalen, T.E. (1976). A K-sample significance test for independent alpha coefficients. Psychometrika, 41, 219-231. 63 Hattie, J. (1985). Methodology review: assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139-164. Hoyt, C.J. (1941). Test reliability estimated by analysis of variance. Psychometrika, 6, 153-160. Hunter, J. E., & Schmidt, F. L. (1994). The estimation of sampling error variance in the meta-analysis of correlations: Use of r in the homogenous case. Journal of Applied Psychology, 78, 171-177. Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research findings across studies. Newbury Park, CA: Sage. Johnson, H. G. (1944). An empirical study of the influences of errors of measurement upon correlation. American Journal of Psychology, 57, 521-536. Kehoe, J. F., & Tenopyr, M. L. (1994). Adjustment in assessment scores and their usage: A taxonomy and evaluation of methods. Psychological Assessment, 6, 291-303. Knapp, T. R. (1991). Coefficient alpha: conceptualizations and anomalies. Research in Nursing and Health, 14, 457-460. Komaroff, E. (1997). Effect of simultaneous violations of essential ?-equivalence and uncorrelated error on coefficient ?. Applied Psychological Measurement, 21, 337-348. Komaroff, E. (1997). SEMNET discussion on alpha and correlated error. Retrieved February 10, 2006 from http://bama.ua.edu/cgi bin/wa?A2=ind9710&L=semnet&T=0&F=&S=&P=770 64 Kristof, W. (1974). Estimation of reliability and true score variance from a split of a test into three arbitrary parts. Psychometrika, 39, 491-499. Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151-160. Miller, M. B. (1995). Coefficient alpha: a basic introduction from the perspectives of classical test theory and structural equation modeling. Structural Equation Modeling, 2, 255-273. Murphy, K. R., & Davidshofer, C. (2001). Psychological Testing: Principles and Applications (5 th ed.). Upper Saddle River, NJ: Prentice Hall. Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32, 1-13. Raykov, T. (1998). Coefficient alpha and composite reliability with interrelated nonhomogenous items. Applied Psychological Measurement, 22, 35-385. Reeve, C. L., Heggestad, E. D., & George, E. (2005). Estimation of transient error in cognitive ability scales. International Journal of Selection and Assessment, 13, 316 ? 320. Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press. Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis (pp. 231-244). New York: Russell Sage Foundation Schmidt, F. L. (1991). Why all banding procedures in personnel selection are logically flawed. Human Performance, 4, 265-277. 65 Schmidt, F. L., & Hunter, J. E. (1995). The fatal internal contradiction in banding: its statistical rationale is logically inconsistent with its operational procedures. Human Performance, 8, 203-214. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274. Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: an empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206- 224. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Methods, 8, 350-353. Sherer, M., Maddux, J. E., Mercandante, B., Prentice-Dunn, S., Jacobs, B., & Rogers, R. W. (1982). The self-efficacy scale. Psychological Reports, 76, 707-710. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures. Bowling Green, OH: SIOP. Thompson, B. (2003). Score Reliability: Contemporary Thinking on Reliability Issues. Thousand Oaks, CA: Sage Publications. Thorndike, R.L. (1951). Reliability. In Lindquist, E.F. (ed.), Educational Measurement. ACE, Washington, DC. Uniform Guidelines on Employee Selection Procedures, 43 Fed. Reg. 38290-38315 (1978). 66 van Zyl, J. M., Neudecker, H., & Nel, D. G. (2000). On the distribution of the maximum likelihood estimator of Cronbach?s alpha. Psychometrika, 65, 271-280. Vautier, S., & Jmel, S. (2003). Transient Error of Specificity? An alternative to the staggered equivalent split-half procedure. Psychological Methods, 8, 225-238. Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Zedeck, S., Outtz, J., Cascio, W. F., Goldstein, I. L. (1991). Why do ?testing experts? have such limited vision? Human Performance, 4, 297-308. Zimmerman, D. W., Zumbo, B. D., & Lalonde, C. (1993). Coefficient alpha as an estimate of test reliability under violation of two assumptions. Educational and Psychological Measurement, 53, 33-49. 67 68 APPENDICES 69 Appendix A Example Item Change from 1999 Administration to 2001 Administration 1999 - When executing an arrest warrant, there are several procedures which must be carefully followed to ensure that the warrant is executed properly. For example, if the arrestee demands to see the warrant before the arrest is made, the arresting officer must: a) show the warrant to the arrestee and allow the arrestee to examine it, before proceeding with the arrest. b) show the warrant to the arrestee as soon as practicable, even if that time is after the arrest.* c) explain the cause of the arrest either by stating the substance of the warrant or by reading it to the arrestee. d) issue a copy of the warrant to the arrestee as soon as practicable, even if that time is after the arrest. 2001 - If an arrestee demands that an officer executing an arrest warrant show the warrant before the arrest is made, the arresting officer must: a) explain the cause of the arrest by reading the warrant to the arrestee before proceeding with the arrest. b) provide the arrestee with a copy of the warrant as soon as practicable even if that time is after the arrest.* c) show the warrant to the arrestee as soon as practicable even if that time is after the arrest. d) allow the arrestee to examine the warrant before proceeding with the arrest. 70 Appendix B COVARIANCE MATRICES 1999 Covariance Matrix Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 1 .090 .007 -.002 .010 .015 .023 -.001 .004 -.016 .011 -.014 .008 .004 .005 Item 2 .007 .250 -.009 -.019 .012 -.008 -.003 .012 -.003 -.011 .015 .001 -.004 .018 Item 3 -.002 -.009 .189 .010 .017 .032 -.001 .014 .025 .016 .009 .004 -.009 .007 Item 4 .010 -.019 .010 .246 .028 .015 .003 .008 .041 -.018 .003 .016 .002 .008 Item 5 .015 .012 .017 .028 .121 .037 -.001 .024 .008 -.021 .013 .004 .027 .009 Item 6 .023 -.008 .032 .015 .037 .232 -.002 .010 .006 -.001 .033 -.019 .007 .018 Item 7 -.001 -.003 -.001 .003 -.001 -.002 .006 .002 .004 -.002 .005 -.001 -.003 .000 Item 8 .004 .012 .090 .007 -.002 .010 .015 .023 -.001 .004 -.016 .011 -.014 .008 Item 9 -.016 -.003 .007 .250 -.009 -.019 .012 -.008 -.003 .012 -.003 -.011 .015 .001 Item 10 .011 -.011 -.002 -.009 .189 .010 .017 .032 -.001 .014 .025 .016 .009 .004 Item 11 -.014 .015 .010 -.019 .010 .246 .028 .015 .003 .008 .041 -.018 .003 .016 Item 12 .008 .001 .015 .012 .017 .028 .121 .037 -.001 .024 .008 -.021 .013 .004 Item 13 .004 -.004 .023 -.008 .032 .015 .037 .232 -.002 .010 .006 -.001 .033 -.019 Item 14 .005 .018 -.001 -.003 -.001 .003 -.001 -.002 .006 .002 .004 -.002 .005 -.001 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 1 -.004 -.018 -.001 .000 -.033 -.016 -.002 -.031 -.004 -.002 -.009 .007 -.020 .010 Item 2 .004 .039 -.001 .018 .031 .015 .003 .035 -.001 -.012 .031 -.002 -.008 .015 Item 3 -.007 -.012 .015 .004 -.006 -.006 -.001 .002 -.006 .007 -.014 .017 -.003 .003 Item 4 -.007 .012 -.009 .004 .011 .006 -.001 .025 .000 .001 .004 .017 .020 -.008 Item 5 .011 .026 .010 .012 .012 .014 -.001 .006 -.016 .003 .017 -.006 -.006 .010 Item 6 .015 -.015 .016 -.001 .004 .057 -.002 .029 .033 .011 .006 .015 .009 .001 Item 7 -.001 .003 -.001 .003 -.001 -.002 .000 .002 -.002 .004 .005 .005 .003 .000 Item 8 -.001 .026 -.012 .008 .027 .049 -.001 .027 .013 .033 .021 -.001 .024 .006 Item 9 .005 .002 .010 .018 .006 .002 -.001 .018 .054 .015 -.001 .011 .011 -.008 Item 10 -.002 -.002 -.004 .004 -.002 -.001 .000 .000 .001 .005 -.004 -.002 .004 .005 Item 11 .007 .016 .003 -.004 .009 .003 -.001 .006 -.017 -.008 .008 -.005 -.011 .017 Item 12 .015 .009 .034 .025 .015 .029 -.001 .014 .024 .006 .012 .015 .019 .008 Item 13 -.007 .028 -.003 .003 .031 .015 -.002 .051 .023 -.036 .004 .011 .072 .012 Item 14 -.002 -.005 .006 .008 .008 .009 .000 .008 .005 .003 .013 .004 .002 -.002 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 1 .241 -.014 .001 .001 .004 .018 -.002 -.028 -.014 -.007 .028 -.017 -.033 -.009 Item 2 -.014 .251 .007 .025 .010 .005 .003 .025 -.002 .003 .007 .004 .009 -.006 Item 3 .001 .007 .113 .025 .008 .013 -.001 -.007 .002 .004 .004 .010 .00 -.003 Item 4 .001 .025 .025 .113 .008 .001 -.001 .005 -.010 -.002 .010 -.007 .012 .003 Item 5 .004 .010 .008 .008 .108 .026 -.001 .006 .002 .004 .005 .011 .026 .009 Item 6 .018 .005 .013 .001 .026 .195 -.002 .009 .015 -.005 .007 .020 .004 .006 Item 7 -.002 .003 -.001 -.001 -.001 -.002 .006 -.001 -.001 .000 -.001 .005 -.002 .000 Item 8 -.028 .025 -.007 .005 .006 .009 -.001 .153 .018 .003 -.002 .008 .044 .007 Item 9 -.014 -.002 .002 -.010 .002 .015 -.001 .018 .108 .004 -.013 .023 .014 .003 Item 10 -.007 .003 .004 -.002 .004 -.005 .000 .003 .004 .017 -.002 .003 .005 .000 Item 11 .028 .007 .004 .010 .005 .007 -.001 -.002 -.013 -.002 .095 -.004 -.009 -.002 Item 12 -.017 .004 .010 -.007 .011 .020 .005 .008 .023 .003 -.004 .126 .006 .232 Item 13 -.033 .009 .000 .012 .026 .004 -.002 .044 .014 .005 -.009 .006 .232 .015 Item 14 -.009 -.006 -.003 .003 .009 .006 .000 .007 .003 .000 -.002 .002 .015 .023 71 1999 & 2001 Covariance Matrix 2001 Covariance Matrix Appendix C SYNTAX USED TO CALCULATE ASE AND CONFIDENCE INTERVALS 9 matrix. compute numbitem = 14. compute numbsubj = 171. compute itemcov = {.241, -.014, .001, .001, .004, .018, -.002, -.028, -.014, -.007, .028, -.017, - .033, -.009; -.014, .251, .007, .025, .010, .005, .003, .025, -.002, .003, .007, .004, .009, -.006; .001, .007, .113, .025, .008, .013, -.001, -.007, .002, .004, .004, .010, .000, -.003; .001, .025, .025, .113, .008, .001, -.001, .005, -.010, -.002, .010, -.007, .012, .003; .004, .010, .008, .008, .108, .026, -.001, .006, .002, .004, .005, .011, .026, .009; .018, .005, .013, .001, .026, .195, -.002, .009, .015, -.005, .007, .020, .004, .006; -.002, .003, -.001, -.001, -.001, -.002, .006, -.001, -.001, .000, -.001, .005, -.002, .000; -.028, .025, -.007, .005, .006, .009, -.001, .153, .018, .003, -.002, .008, .044, .007; -.014, -.002, .002, -.010, .002, .015, -.001, .018, .108, .004, -.013, .023, .014, .003; -.007, .003, .004, -.002, .004, -.005, .000, .003, .004, .017, -.002, .003, .005, .000; .028, .007, .004, .010, .005, .007, -.001, -.002, -.013, -.002, .095, -.004, -.009, -.002; -.017, .004, .010, - .007, .011, .020, .005, .008, .023, .003, -.004, .126, .006, .232; -.033, .009, .000, .012, .026, .004, -.002, .044, .014, .005, -.009, .006, .232, .015; -.009, -.006, -.003, .003, .009, .006, .000, .007, .003, .000, -.002, .002, .015, .023}. compute one=make(numbitem, 1,1). compute jtphij=transpos(one). compute jtphij = jtphij * itemcov. compute jtphij = jtphij * one. compute trmy=trace(itemcov). compute trmy=trmy/jtphij. compute myalpha=1-trmy. compute nn1=numbitem-1. compute nn1=numbitem/nn1. compute myalpha=nn1 * myalpha. compute trphisq=itemcov*itemcov. compute trphisq=trace(trphisq). compute trsqphi=trace(itemcov). compute trsqphi=trsqphi**2. compute ttp=itemcov * itemcov. compute jtphisqj=transpos(one). compute jtphisqj=jtphisqj * ttp. compute jtphisqj=jtphisqj * one. compute omega=trphisq+trsqphi. compute omega=jtphij * omega. compute omegab=trace(itemcov). compute omegab=omegab * jtphisqj. compute omega=omega-(2*omegab). compute omega=(2/(jtphij**3))*omega. compute s2=(numbitem**2) / ((numbitem-1)**2). compute s2=s2*omega. compute se=sqrt(s2/numbsubj). compute cimin95=myalpha-(1.96*se). compute cimax95=myalpha+(1.96*se). print myalpha /format ="f8.3"/title= 'Your coefficient alpha is:'. 72 9 Green, S. B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88-101. Hakstian, A. R., & Whalen, T. E. (1976). A K-sample significance test for independent alpha coefficients. Psychometrika, 41, 219-231. print cimin95 /format = "f8.3"/title= 'The lower 95% confidence limit follows:'. print cimax95 /format = "f8.3"/title= 'The upper 95% confidence limit follows:'. end matrix. 73 Appendix D Candidates? Race and Rank Candidate ID Race Rank 74 7 White 1 13 White 1 35 White 1 42 White 1 59 White 1 72 White 1 100 White 1 103 White 1 108 African American 1 120 White 1 123 White 1 142 White 1 143 White 1 11 White 14 12 White 14 18 African American 14 20 White 14 22 White 14 30 White 14 36 White 14 38 White 14 49 White 14 53 White 14 66 African American 14 71 White 14 78 White 14 79 White 14 81 White 14 88 White 14 90 African American 14 93 White 14 104 White 14 109 White 14 110 White 14 113 White 14 114 White 14 118 African American 14 122 White 14 124 White 14 127 White 14 132 African American 14 Candidate ID Race Rank 133 White 14 134 White 14 135 African American 14 144 White 14 146 White 14 154 African American 14 159 White 14 162 White 14 167 White 14 168 African American 14 6 White 15 10 White 53 14 White 53 17 African American 53 19 White 53 23 African American 53 24 African American 53 25 White 53 26 African American 53 33 White 53 39 White 53 40 White 53 41 White 53 51 White 53 54 White 53 57 White 53 58 White 53 60 White 53 62 White 53 67 White 53 69 White 53 70 White 53 74 White 53 83 White 53 86 African American 53 89 African American 53 96 White 53 98 White 53 Candidate ID Race Rank 75 101 African American 53 105 White 53 107 African American 53 112 African American 53 126 African American 53 128 African American 53 131 White 53 137 White 53 139 White 53 147 African American 53 156 African American 53 160 African American 53 161 White 53 164 White 53 166 White 53 170 African American 53 3 African American 96 9 African American 96 15 White 96 16 White 96 21 African American 96 27 White 96 43 White 96 44 African American 96 45 African American 96 46 African American 96 48 White 96 52 African American 96 55 White 96 61 African American 96 68 White 96 76 African American 96 80 African American 96 Candidate ID Race Rank 84 African American 96 87 White 96 95 African American 96 106 White 96 116 White 96 117 White 96 121 White 96 125 White 96 136 African American 96 138 African American 96 145 White 96 148 African American 96 150 African American 96 158 African American 96 171 White 96 4 African American 128 8 White 128 29 White 128 31 White 128 32 White 128 37 White 128 47 African American 128 50 African American 128 63 African American 128 73 African American 128 75 White 128 77 White 128 82 African American 128 91 African American 128 94 White 128 99 African American 128 102 White 128 111 White 128 129 African American 128 140 African American 128 Candidate ID Race Rank 149 African American 128 163 African American 128 165 African American 128 2 African American 151 5 African American 151 34 African American 151 56 African American 151 64 African American 151 65 White 151 97 African American 151 115 White 151 119 White 151 141 African American 151 153 White 151 157 White 151 169 African American 163 1 White 163 28 African American 163 85 African American 163 92 African American 163 130 White 163 152 African American 163 155 African American 163 151 African American 171 76 Table 1 Racial differences using different selection techniques d score Selection Technique White-Black White-Hispanic Meta-Analysis Cognitive Ability 1.10 .72 Roth, BeVier, Switzer, & Tyler (2001) GPA .78 N/A Roth & Bobko (2000) Job Sample/Job Knowledge .38 .00 Schmitt, Clause, & Pulakos (1999) Biodata .33 N/A Bobko, Roth, & Potosky (1999) Structure Interview .23 N/A Huffcut & Roth (1998) d score represents the difference between standardized population means; Source: Aamodt (2004) 77 Table 2 Hypothetical Score Distribution and Test Score Use (from page X) Test Score Race Selection Possibility ? Top Down Selection Possibility ? Banding 95 Caucasian X X 94 Caucasian X X 94 Caucasian X X 92 Caucasian X X 91 Caucasian X X 89 African American X 89 Caucasian X 89 Caucasian X 87 African American X 87 African American X 86 African American X 86 Caucasian X 86 African American X 85 Caucasian 85 African American 83 Caucasian 82 Caucasian 82 African American 81 Caucasian 80 Caucasian 80 African American 80 Caucasian 80 Caucasian 79 African American 79 Caucasian Percentages 64% - Caucasian 36% - African American 100% - Caucasian 0% - African American 62% - Caucasian 38% - African American 78 Table 5 KSA and Assigned Testing Modalities for Sergeant Selection Procedures Knowledge/Skill/Ability Written (M.C.) Test In-Basket/Work Sample Oral Examination Technical Knowledge ? Oral Expression ? Written Expression ? Interpersonal Relations ? Information Analysis ? ? Judgment & Decision Making ? ? Planning & Organizing ? ? Resource Management ? ? 79 Figure 1 Item covariance matrix for test-retest data (from Green, 2003) 80