Differential Functioning by High and Low Impression Management Groups  
on a Big Five Applicant Screening Tool 
 
by 
 
Brennan Daniel Cox 
 
 
 
 
A dissertation submitted to the Graduate Faculty of 
Auburn University 
in partial fulfillment of the 
requirements for the Degree of 
Doctor of Philosophy 
 
Auburn, Alabama 
May 14, 2010 
 
 
 
 
Keywords: faking, impression management, personality, selection,  
measurement equivalence, differential functioning 
 
Copyright 2010 by Brennan Daniel Cox 
 
 
Approved by 
 
Adrian Thomas, Chair, Associate Professor of Psychology 
Daniel Svyantek, Associate Professor of Psychology 
Jacqueline Mitchelson, Assistant Professor of Psychology 
William Buskist, Distinguished Professor in the Teaching of Psychology
 
 
ii 
 
 
 
 
 
 
 
Abstract 
 
 
The degree to which applicant personality test faking constitutes a real world 
threat is a topic of considerable debate among industrial and organizational psychologists. 
Researchers have investigated the faking problem using a variety of methodologies, but 
have found inconclusive results. One method for studying faking involves the use of 
impression management scales, which are designed to detect individuals? use of 
intentional response distortion. However, most scales designed to detect applicant faking 
are too lengthy, too general, or otherwise impractical for use in applied settings.  
The current applied research involved the development, implementation, and 
validation of an eight-item impression management scale for use with the Fitability 5a, a 
Big Five personality test used for screening job applicants. Applicants? (n = 21,017) 
scores on the new scale were found to have satisfactory reliability and correlated as one 
might expect with the five personality scales. Applicants considered to be ?fakers? 
produced meaningful score differences on the agreeableness, conscientiousness, 
neuroticism, and openness scales, but not the extraversion scale.  
Additional tests for measurement equivalence were performed using the item 
response theory-based differential functioning of items and tests framework developed by 
Raju, van der Linden, & Fleer (1995). Most personality items (35 of 55) demonstrated 
differential item functioning (DIF). Only items on the extraversion scale did not exhibit 
significant DIF. Significant differential test functioning (DTF) was found for each of the 
 
iii 
 
scales that contained DIF items. Correction for DTF by eliminating items with significant 
DIF was impossible, as DIF was uniform across all items in that the high impression 
management group demonstrated a higher probability of responding positively to the 
items (or negatively, for neuroticism) than the low impression management group. These 
findings suggest that applicant faking is a real world threat to the Fitability 5a, because 
impression management strongly affected the construct validity of personality measure.   
   
 
 
 
iv 
 
 
 
 
 
 
 
Acknowledgments 
 
 
I would like to thank Adrian Thomas for his guidance in developing this study 
and preparing the manuscript. Without his patience, none of this project would have been 
possible. I would also like to thank Jackie Mitchelson for her statistical assistance, as well 
as Dan Svyantek, Bill Buskist, and Alan Walker for their reviews and feedback. They 
helped make this project a rewarding learning opportunity. To my wife Asha, my 
daughter Annabella, my parents, family, and friends: Thank you for your love and 
support. You gave me air as I suffocated myself in this project, and I could not have done 
it without you.     
 
v 
 
 
 
 
 
 
 
Table of Contents 
 
 
Abstract ............................................................................................................................... ii 
Acknowledgements ............................................................................................................ iv 
List of Figures .................................................................................................................. viii 
List of Tables ..................................................................................................................... ix 
Chapter 1. Introduction ........................................................................................................1 
Chapter 2. Literature review ................................................................................................7 
A brief history of personality testing for selection ..................................................7 
The Big Five ................................................................................................7 
Personality testing for personnel selection ..................................................9 
The faking problem ...............................................................................................12 
Can personality tests be faked? ..................................................................14 
Do applicants fake? ....................................................................................15 
Is faking a problem in the real world? .......................................................21 
Summary of the faking problem ................................................................24 
Measurement equivalence and the DFIT framework ............................................25 
Measurement equivalence ..........................................................................25 
The DFIT procedure ..................................................................................26 
DFIT and the faking problem ....................................................................29 
The current research ..............................................................................................33 
 
vi 
 
Chapter 3. Study 1: Development of the Fitability 5a impression management scale ......36 
Method ...................................................................................................................38 
Item development.......................................................................................38 
Participants and procedure .........................................................................39 
Sample A participants and procedure ........................................................40 
Sample A measures ....................................................................................41 
Sample A analyses and results ...................................................................42 
Sample B participants and procedure ........................................................43 
Sample B measures ....................................................................................43 
Sample B analyses and results ...................................................................43 
Discussion ..............................................................................................................45 
Chapter 4. Study 2: Is Faking a Problem for the Fitability 5a? .........................................48 
Method ...................................................................................................................49 
Participants .................................................................................................49 
Measures ....................................................................................................50 
Procedure and analyses ..............................................................................50 
Results ....................................................................................................................53 
DFIT on the Fitability 5a ...........................................................................54 
Agreeableness ............................................................................................55 
Conscientiousness ......................................................................................55 
Extraversion ...............................................................................................56 
Neuroticism ................................................................................................56 
Openness ....................................................................................................57 
 
vii 
 
DFIT on the impression management scale ...............................................57 
Discussion ..............................................................................................................58 
Chapter 5. General discussion............................................................................................60 
Contribution 1: The Fitability 5a impression management scale ..........................60 
Contribution 2: Faking matters in the real world ...................................................63 
Limitations and future directions ...........................................................................68 
Implications............................................................................................................74 
References ..........................................................................................................................77 
Appendices .........................................................................................................................88 
A.  Figures .............................................................................................................88 
B.  Tables ...............................................................................................................93 
C.  Measures ........................................................................................................109 
 
 
 
viii 
 
 
 
 
List of Figures 
1. Item response function (item characteristic curve) ......................................................90 
2. Category response function for a five category item ...................................................91 
3. Scree plot and Eigenvalues for impression management scale development ..............92 
 
ix 
 
 
 
 
List of Tables 
1. Validity of selection tests commonly used for predicting overall job performance ....94 
2. DFIT research on personality test faking and measurement equivalence ....................95 
3. Items, factor loadings, and internal consistency estimates for the four-factor  
solution .........................................................................................................................96 
4. Correlations among the four factors and the BIDR scales ...........................................97 
 
5. Impression management item and scale means and standard deviations  ...................98 
6. Scale means and standard deviations for the total sample and high/low impression 
management subgroups ................................................................................................99 
7. Correlations among the variables for the total sample...............................................100 
8. Correlations among the variables for the high impression management group .........101 
9. Correlations among the variables for the low impression management group ..........102 
10. Agreeableness item means and NCDIF, CDIF, and DTF values for fakers versus  
non-fakers ..................................................................................................................103 
11. Conscientiousness item means and NCDIF, CDIF, and DTF values for fakers versus  
non-fakers ..................................................................................................................104 
12. Extraversion item means and NCDIF values for fakers versus non-fakers ...............105 
13. Neuroticism item means and NCDIF, CDIF, and DTF values for fakers versus  
non-fakers ..................................................................................................................106 
14. Openness item means and NCDIF, CDIF, and DTF values for fakers versus  
 
x 
 
non-fakers ..................................................................................................................107 
15. Impression management item means and standard deviations by gender and race ...108 
 
 
1 
 
 
 
 
Chapter 1 
Introduction 
 Personality tests are becoming increasingly more popular for use in employment 
selection (Rothstein & Goffin, 2006; Viswesvaran, Deller, & Ones, 2007). Explanations 
for this growth are due largely to widespread acceptance of the five-factor model for 
organizing and describing personality constructs as well as evidence that personality 
measures predict job-relevant criteria without demonstrating adverse impact (Barrick & 
Mount, 2005). Despite these encouraging developments, most personality assessments 
used in selection are self-report and may therefore be susceptible to impression 
management (i.e., faking good). The extent to which applicant faking actually constitutes 
a real world problem is currently a topic of considerable debate in industrial and 
organizational (I/O) psychology (e.g., Morgeson et al., 2007; Ones, Dilchert, 
Viswesvaran, & Judge, 2007; Tett & Christiansen, 2007).  
 In 2004, a panel of current and former editors of Personnel Psychology and the 
Journal of Applied Psychology gathered at the annual Society for I/O Psychology 
conference to discuss the issue of personality test faking. In an effort to reach closure, 
these experts concluded that (a) applicants can and do fake on personality tests, (b) faking 
can affect criterion-related validity, (c) efforts to correct for faking do not resolve this 
problem completely, and (d) for some jobs, presenting a false, but favorable impression 
may actually be a desirable applicant characteristic (Morgeson et al., 2007). Tett and 
 
2 
 
Christiansen (2007) added to these conclusions that (a) applicants tend to fake to different 
degrees, (b) individual differences in faking can upset the rank-order of applicants, and 
(c) faking attenuates, but may not necessarily destroy personality test validity. Ones et al. 
(2007) responded to these declarations with expressed concern that some of the panel?s 
conclusions may be unwarranted because of the different methodologies used in the 
studies reviewed by the panel. To date, most researchers have examined the issue of 
personality test faking using traditional methodologies, such as comparing mean scores of 
laboratory participants directed to respond honestly and then fake good, or by conducting 
meta-analyses comparing applicants to non-applicants (Ones et al., 2007). The results of 
studies are largely mixed, with some studies concluding that faking is a problem and 
others not. The debate over faking provides a need for additional research examining the 
faking problem using alternative methodologies than those typically used in faking 
studies, as well as research investigating the faking phenomenon as it occurs in real world 
settings and with real world job applicants.   
 Most research investigating the consequences of faking examines whether faking 
disrupts the criterion-related validity of personality tests. One lesser-used approach that 
may inform the debate involves testing to determine if faking disrupts the measurement 
properties of personality tests, such as the construct validity of these tests. It would be 
considerably problematic, for instance, if personality tests (or test items) were found to 
function differently for individuals that fake versus those that do not fake (or fake to a 
lesser degree). This form of measurement bias refers to the concept of measurement 
equivalence and holds profound implications for organizations that use the results of 
personality tests for making employment-based decisions.  
 
3 
 
Measurement equivalence occurs when ?the relations between observed scores 
and latent constructs are identical across relevant groups? (Drasgow & Kanfer, 1985, p. 
662). In the presence of measurement equivalence, the items on a personality test should 
be equally accurate for individuals who fake versus those who do not fake. Thus, two sets 
of applicants who have the same standing on a latent personality construct (e.g., 
conscientiousness) should score identically on a test of this construct, regardless of their 
respective differences in faking. It should not matter if one set of applicants fakes good 
and the other does not ? the relationship between their standing on the construct and their 
observed scores should remain identical.  
Alternatively, in the absence of measurement equivalence, a test could potentially 
favor members of one group over the other: Two sets of applicants with identical levels 
of conscientiousness would respond differently to the items on a conscientiousness 
measure. Applicants engaging in impression management, for instance, might respond 
more favorably to desirable items than applicants not engaging in impression 
management, despite having equal standing on the latent construct. The key issue with 
measurement equivalence is not whether faking results in inflated scores on personality 
tests (though this outcome is likely to occur), but whether faking affects how respondents 
interpret the test or test items. Such a scenario would compromise the validity of the 
personality measure and make the interpretation and use of scores for making selection 
decisions impossible. Demonstrating measurement equivalence between fakers and non-
fakers is therefore an issue of high practical importance for organizations that use 
personality tests for selection.    
 
4 
 
 Researchers can test for measurement equivalence at the item-level or across an 
entire scale or test by conducting analyses based on differential item functioning (DIF) or 
differential test functioning (DTF), respectively. Although numerous procedures are 
available for assessing DIF and DTF, the differential functioning of items and tests 
(DFIT) procedure developed by Raju, van der Linder, and Fleer (1995) has proven 
particularly useful for organizational researchers. The DFIT procedure not only identifies 
items with significant DIF, but it also determines the effects of eliminating such items on 
the overall functioning of the test. Thus, psychometricians can use the DFIT procedure to 
evaluate and potentially correct for differential responding by members of different 
groups, thereby increasing measurement equivalence. In addition, because the DFIT 
procedure works with dichotomous as well as polytomous models, it applies to most 
measures used in employment contexts, including personality tests.  
 Previous researchers have used the DFIT procedure to examine the measurement 
equivalence of personality tests used for selection across groups of fakers and non-fakers. 
Flanagan and Raju (1997) applied this technique on the extraversion scale of the 16-PF 
and Henry and Raju (2006) used the DFIT procedure to examine measurement 
equivalence on an empirically derived conscientiousness scale of the California 
Psychological Inventory (CPI). In both studies, the researchers evaluated item-level and 
scale-level scores on the personality measures for differential functioning by comparing 
high and low/average scorers on the impression management scales included with these 
measures (used to represent fakers and non-fakers, respectively). With the exception of a 
few minor differences, they found that the measures functioned in the same manner for 
each group. Thus, both studies concluded that faking, as measured with impression 
 
5 
 
management scales, might not be a significant problem for personality tests used for 
selection.  
The DFIT studies by Flanagan and Raju (1997) and Henry and Raju (2006) were 
limited, however, in three key ways. First, both studies examined only one personality 
dimension; as such, the conclusions reached apply only to the scales used in these studies. 
It is possible that applicant faking could affect the validity of alternative personality tests 
that contain additional or different scales. Second, neither the 16-PF nor the CPI measure 
the five-factor model of personality, the most commonly accepted model of personality. 
Henry and Raju even had to derive their conscientiousness scale empirically from items 
intended for other CPI scales in order to assess this Big Five construct. It is possible that 
a personality test designed to assess the Big Five factors directly could produce different 
results. Finally, as a third limitation, the response scale for the16-PF uses a three level 
forced-choice format and the response scale for the CPI uses a dichotomous True/False 
format. Therefore, individuals who take these tests are restricted to a narrow set of 
response options, that could produce a restriction of range in their overall scores. It is 
possible that a personality test that uses a response scale with more than three options 
could influence the degree to which applicants fake. One goal of the current research was 
to assess these possibilities by applying the DFIT framework to investigate the impact of 
real world applicant faking on all five scales of a true Big Five measure of personality, 
one that uses a polytomous, five-category response format. 
 In this dissertation, the measure of interest was the Fitability 5a. The Fitability 5a 
is a Big Five personality test ?specifically designed for job applicant populations? 
(Lucius, 2003, p. 40). Thousands of job candidates complete this measure each month for 
 
6 
 
employment screening purposes, including applicants to Fortune 500 companies. 
Observations since the induction of the Fitability 5a, however, indicate that applicants 
and non-applicants score differently on this measure. One potential explanation for these 
score differences is that some applicants might be faking on the Fitability 5a in order to 
increase their desirability to the hiring organization. Although many personality tests 
(e.g., the 16-PF and CPI) have custom scales that detect applicant faking, no such 
measurement device existed for the Fitability 5a. Therefore, in Study 1, a scale was 
developed for assessing impression management on the Fitability 5a. Next, in order to 
investigate the influence of faking on the Fitability 5a, in Study 2 DIF and DTF analyses 
were performed across high and low impression management groups to evaluate whether 
faking affects the measurement equivalence of the Fitability 5a?s scales.     
 
7 
 
 
 
 
Chapter 2 
Literature Review 
A Brief History of Personality Testing for Selection 
Organizational researchers have investigated the use of personality tests in 
employment contexts for over 100 years. In reviewing this body of work, Barrick, Mount, 
and Judge (2001) outlined two distinct phases. The first phase, which lasted from the 
early 1900s to the mid-1980s, was largely pessimistic and is often summarized using 
Guion and Gottier?s (1965) cautionary conclusion: ?It is difficult?  to advocate with a 
clear conscience, the use of personality measures in most situations as a basis for making 
employment decisions about people? (p. 160). This sentiment was justified for several 
reasons, including the fact that the researchers of this period lacked a proper system for 
managing the vast complexity of personality traits used to describe people (Barrick et al., 
2001). The past 30 years, however, has seen a surge of support for the use of personality 
measures in the workplace. The second phase of Barrick et al.?s (2001) history, which 
continues today, developed from years of converging evidence in support of the unifying 
five-factor model for classifying personality traits. 
The Big Five. Early efforts to produce a definitive taxonomy for organizing 
personality attributes began with Galton?s (1884) lexical hypothesis, which proposed a 
complete catalog for all personality traits could come from sampling the vocabulary (i.e., 
lexicon) people use to describe each other. In applying Galton?s theory, Allport and 
 
8 
 
Odbert (1936) performed an exhaustive dictionary search that produced a list of nearly 
18,000 personality-type adjectives, which they synthesized into a more manageable list of 
4,500 distinct personality traits. Cattell (1957) later condensed this list to 171 terms and 
eventually, via factor analysis, derived 16 comprehensive factors that he considered 
fundamental to describing normal personality (Cattell, Eber, & Tatsuoka, 1970). Several 
efforts to replicate Cattell?s work, however, were largely unsuccessful, as many 
researchers were unable to derive more than five general factors of personality (e.g., 
Borgatta, 1964; Fiske, 1949; Norman, 1963; Smith, 1967; Tupes & Christal, 1961). At 
the time, the leading personality theorists overlooked these findings in support of more 
established theories. Indeed, it was not until Goldberg?s 1981 lexical analysis that the 
five-factor model, or the ?Big Five,? gained popular applied acceptance (Digman, 1990).  
The Big Five personality factors include agreeableness, conscientiousness, 
extraversion, neuroticism, and openness to experience. Each of these major traits 
subsumes a larger number of more specific traits, which often appear as contrasting 
adjectives to characterize high and low scorers on Big Five measures (McCrae & Costa, 
1987). Agreeableness represents several opposing trait comparisons, including the degree 
to which a person is good-natured versus irritable, courteous versus rude, lenient versus 
critical, flexible versus stubborn, and sympathetic versus callous. Traits associated with 
conscientiousness include dependable, hardworking, and organized, versus careless, 
unreliable, and lazy. Extraversion concerns one?s interpersonal style, with high scorers 
(i.e., extraverts) tending to be sociable, energetic, and assertive, while low scorers (i.e., 
introverts) tend to be more reserved, lonesome, and quiet. The neuroticism factor (or 
emotional stability, conversely) represents a person?s tendencies toward negative 
 
9 
 
emotions, like whether one is generally calm, relaxed, and hardy, instead of worrying, 
nervous, and vulnerable. Lastly, the openness factor characterizes individuals as 
intellectually curious, creative, and daring, versus conventional, cautious, and 
straightforward.  
The introduction of the Big Five marked the beginning of what Barrick et al. 
(2001) called the renaissance of personality testing. As these authors discussed, the 
majority of studies conducted since the mid-1980s have used some variant of the five-
factor model to conceptualize personality. This model generalizes across cultures and 
rating formats (e.g., self, peer, observer), and evidence suggests that these traits are 
heritable and stable over time (Costa & McCrae, 1992). Although the Big Five should not 
be considered the end-all model of personality (McAdams, 1992), there is a consensus 
among trait theorists that these five factors can be used to describe all human personality 
traits effectively (Cervone & Pervin, 2007). The introduction of the five-factor model 
provided hope in light of Guion and Gottier?s (1965) early warnings?once researchers 
and practitioners had an agreed upon system for categorizing traits, they seemed less 
hesitant in using personality tests for employment purposes.      
Personality testing for personnel selection. Perhaps the most compelling 
argument in support of the use of personality tests for employment decision-making came 
in the early 1990s, when separate meta-analyses conducted by Barrick and Mount (1991) 
and Tett, Jackson, and Rothstein (1991) concluded that organizations can use Big Five 
measures to predict job-relevant performance criteria. In their study, Barrick and Mount 
examined the Big Five in relation to job proficiency, training proficiency, and personnel 
data (e.g., salary level, turnover, status change, tenure) for various occupational groups. 
 
10 
 
Of all of the Big Five dimensions, conscientiousness demonstrated the most consistent 
relationships with all criterion types and for all occupational groups. For this reason, 
Barrick and Mount recommended that conscientiousness be considered the primary 
personality variable of interest for organizational researchers and personality test 
practitioners (Mount & Barrick, 1995).  
In addition to their evidence promoting the use of conscientiousness for 
workplace research, Barrick and Mount (1991) also determined that each of the 
remaining Big Five factors were valid predictors for at least one criterion variable and for 
at least one occupational group. Extraversion, for example, predicted all three 
performance criteria for individuals employed in both sales and management positions. 
Thus, for these two occupational groups, being outgoing, sociable, and assertive (i.e., 
extraverted), as opposed to inactive, quiet, and reserved (i.e., introverted), was associated 
with better performance on the job, in training, and across personnel data (e.g., higher 
pay, less turnover). In addition, scores on the openness factor predicted training 
proficiency for all job categories, including sales, management, police, skilled/semi-
skilled, and professional occupations. This finding suggested that individuals who are 
intellectually curious and willing to change (i.e., open to experience) tend to perform 
better in training than individuals who are suspicious or narrow-minded, regardless of 
occupation.  
In the same year as Barrick and Mount?s (1991) influential meta-analysis, Tett et 
al. also concluded that the Big Five factors contribute to the prediction of job 
performance. These researchers extended Barrick and Mount?s findings by calculating 
the average validity of the Big Five factors taken together as well as by examining the 
 
11 
 
potential moderating role of whether organizations used a job analysis to determine their 
choice of personality test. Their analyses produced a corrected mean scale validity of .38 
for studies that relied on a job analysis to select the appropriate personality test to use, 
compared to .29 for studies that did not rely on job analysis. Thus, Tett et al. concluded 
that the criterion-related validity of personality tests used in selection is higher when 
there is a conceptual link between the test and the position under study, as determined 
through a job analysis. Taken together, the meta-analyses by Barrick and Mount and Tett 
et al. provided sufficient empirical support to counter Guion and Gottier?s (1965) early 
warnings. As Tett et al. stated in their concluding remarks: ?Personality measures have a 
place in personnel selection? (p. 732).  
With renewed confidence in the practical validity of personality tests, 
organizational researchers have conducted thousands of studies to determine the overall 
value of these measures in job-related contexts. In a prototypic example, Schmidt and 
Hunter (1998) compared the predictive validity of conscientiousness scales versus 18 
other selection procedures. Their results (see Table 1) indicated that conscientiousness 
measures produced an average validity coefficient of .31 for predicting overall job 
performance. Although this estimate was lower than the .51 validity coefficient for tests 
of general mental ability (g), Schmidt and Hunter noted that tests of g do not explain the 
total variability in job performance and they have a history of producing adverse impact, 
or unintentional discrimination against members of protected groups (e.g., racial 
minorities). For this reason, organizations are encouraged to supplement their selection 
battery with measures that are not g-loaded, such as personality tests (Gatewood & Feild, 
2001).  
 
12 
 
Organizations rely on personality tests for employment selection because 
personality tests reliably predict job outcomes (e.g., Barrick & Mount, 1991; Tett et al., 
1991; Hurtz & Donovan, 2000) without demonstrating adverse impact (Hogan & 
Holland, 2003; Hogan, 2005). Because personality test scores do not correlate with scores 
on g-based tests (Ackerman & Heggestad, 1997), and even explain additional variance in 
employee performance beyond the variance accounted for by g (Schmidt & Hunter, 
1998), they can add value to most selection test batteries currently in use. That is, 
organizations can use personality measures in conjunction with other measures to 
increase the prediction of job performance as well as to help reduce adverse impact 
resulting from their selection batteries (Hough, Oswald, & Ployhart, 2001). 
In sum, personality tests have a place in selection research as well as practice. 
Measures of the Big Five are particularly useful for four main reasons. First, they assess a 
wide range of personality traits (Cervone & Pervin, 2007). Second, they correlate with a 
wide range of job-related performance variables (Ones et al. 2007). Third, they apply to a 
wide range of occupations (Barrick & Mount, 1991). Fourth, Big Five personality test 
scales are unlikely to produce adverse impact in employment selection (Foldes, Duehr, & 
Ones, 2008). Therefore, although the history of workplace personality testing has had its 
share of criticism (see Barrick et al., 2001), it seems worthwhile to pursue the practice of 
personality testing for employment selection.  
The Faking Problem 
Despite major advances in work-related personality test theory, research, and 
application over the last 100 years, one considerable threat to the validity of these 
instruments remains. Most personality tests used by organizations are self-report, and 
 
13 
 
therefore may be susceptible to biased responding. Unlike g-loaded or job knowledge 
tests, which consist of items with one correct answer choice and which require a 
particular ability level to answer correctly on a consistent basis, most personality tests 
contain self-report items with response choices that vary in desirability depending on the 
testing context. Given the high-stakes nature of most selection contexts, some applicants 
may be inclined to distort their responses to personality tests in order to appear more 
attractive to the hiring organization. This possibility represents the faking problem, the 
next hurdle for personality researchers to overcome.         
Applicant faking constitutes ?a conscious effort to manipulate responses to 
personality items to make a positive impression? (Zickar & Robie, 1999, p. 551). The 
assumption behind the faking problem is that some applicants may be able to manipulate 
their responses to personality tests in such a way as to appear more attractive to the hiring 
organization and thereby increase their chances of gaining employment over more honest 
and potentially more qualified applicants. Any faulty hiring decisions that result from 
applicant faking have the potential to affect the organization negatively, as it is unlikely 
that the selected ?fakers? will be able to uphold their false impression forever. Given the 
potential implications of the faking problem, it is no surprise that faking research has 
received a surge of popularity in the past several years (Griffith & McDaniel, 2006). 
Most faking research attempts to answer three general questions: Can personality tests be 
faked? Do applicants fake? Is faking a problem in the real world? Researchers have 
investigated these questions using a variety of methodologies and have produced mixed 
and therefore inconclusive results (e.g., Birkeland, Manson, Kisamore, Brannick, & 
Smith, 2006; Hogan, Barrett, & Hogan, 2007; McFarland & Ryan, 2000; Morgeson et al., 
 
14 
 
2007; Ones et al., 2007; Tett et al., 2007). A review of this literature demonstrates a need 
for additional research investigating the faking problem.  
Can personality tests be faked? Most personality tests used by organizations are 
self-report. Applicants, then, have the potential to inflate their responses to these 
measures in order to appear more desirable to the hiring organization. To assess the 
degree to which examinees can fake on a personality test, researchers typically conduct 
lab studies using some form of direction manipulation. For example, McFarland and 
Ryan (2000) instructed student participants to respond to the items on the NEO Five 
Factor Inventory (Costa & McCrae, 1989) under two experimental conditions. In one 
condition, they instructed students to answer the items ?as honestly as possible?; in the 
other condition, they instructed students to answer ?in such a way as to make [them] look 
as good an applicant as possible for a job [they] would want? (p. 815). They 
demonstrated that examinees are capable of producing statistically significant score 
changes on all five scales of the personality test. The effect sizes (Cohen?s d) for these 
changes were 1.82 for conscientiousness, 1.66 for neuroticism, 1.06 for agreeableness, 
0.98 for extraversion, and 0.19 for openness. Thus, with the exception of the openness to 
experience factor, student participants produced significant and meaningful changes in 
their personality tests scores when instructed to fake good.  
Winkelspecht, Lewis, and Thomas (2006) conducted a similar lab-based directed 
faking study. These researchers found that when they instructed participants to fake good, 
participants not only produced higher scores, but they also rose to the top of the score 
distribution. Thus, they concluded that, in a top-down selection scenario, individuals who 
 
15 
 
fake good on personality tests are capable of improving their chances of obtaining 
employment over more qualified, honest respondents.   
Lab studies using faking instructions are useful for demonstrating the extent to 
which examinees can fake on a personality test as well as potential consequences of 
faking on personality test validity. However, lab studies are limited in that students 
instructed to fake good may not represent true applicant behavior. In lab studies, there are 
no negative consequences for faking. Therefore, when participants in these studies 
receive instructions to fake good or present the most favorable impression, they are likely 
to maximize their faking efforts. It is unlikely that individuals in a true selection setting 
respond in this manner, given the high stakes nature of the selection context and the 
potential repercussions for being caught faking. Of course, to determine applicant 
behavior, there is no better source of data than actual job applicants.  
 Do applicants fake? To investigate whether actual job applicants fake, 
researchers tend to use one of three general methodologies: (a) between-subject designs 
comparing applicants to non-applicants, (b) within-subject designs comparing individuals 
in applicant and non-applicant conditions, and (c) research designs that use scales 
designed to detect applicant faking. Each of these methodologies contributes to the 
personality test faking literature in different ways, as discussed in the following 
examples. 
One method for determining whether (and to what extent) actual job applicants 
fake on personality tests involves making score comparisons between samples of 
applicants and non-applicants (e.g., students, incumbents). The rationale behind this 
between-subject methodology is that if applicants score differently than non-applicants, 
 
16 
 
then these score differences may be attributable to applicant faking. Birkeland et al. 
(2006) conducted a meta-analysis of studies using this approach, and discovered that, 
across all job types, applicants scored significantly higher than non-applicants on 
measures of conscientiousness (d = .45), emotional stability (d = .44), openness (d = .13), 
and extraversion (d = .11). A comparison of these effect sizes to those obtained in 
McFarland and Ryan?s (2000) direction manipulation study suggests that actual job 
applicants do not inflate their scores to the same magnitude as students instructed to fake 
good. Nevertheless, Birkeland et al.?s study provided evidence that actual job applicants 
tend to score differently than non-applicants, presumably because some members of the 
applicant group fake good. This study represents the general findings of between-subject 
studies of faking comparing applicants to non-applicants.  
Between-subject studies comparing applicants and non-applicants are limited, 
however, in that score differences between the applicant and non-applicant groups could 
arguably be due to true group differences or other factors beyond faking effects. 
Therefore, some faking researchers have adopted within-subject methodologies to 
compare scores from the same individuals within applicant and non-applicant conditions.  
As an example of the within-subject methodology, Ellingson, Sackett, and 
Connelly (2007) compared individuals? scores on the California Personality Index (CPI; 
Gough & Bradley, 1996) based on one of four naturally occurring test-retest conditions 
(i.e., selection-development, selection-selection, development-selection, or development-
development). Individuals? scores from the selection context represented the faking 
condition while their scores from the development context represented the non-faking 
condition. Thus, any score differences between the selection and development conditions 
 
17 
 
were attributed to the effects of faking. After controlling for time between test 
administrations as well as feedback effects, Ellingson et al. estimated an average 
personality score change effect size of 0.08 for the two faking conditions. Therefore, they 
concluded that applicant faking was not a problem for the CPI. 
In another within-subjects study, Griffith, Chmielowski, and Yoshita (2007) 
examined score changes on a conscientiousness scale for individuals who took the 
measure first as job applicants and later for research purposes. In the research condition, 
participants received instructions first to respond honestly and then to fake good. 
Consistent with previous direction manipulation studies, participants significantly 
inflated their scores on the conscientiousness scale when instructed to fake good, such 
that their scores on this measure were uncorrelated with as well as significantly more 
positive than scores from the respond honestly and applicant conditions. Mean scores 
from the applicant and respond honestly conditions, however, were significantly 
correlated with one another (r = 0.50, p < .001, d = 0.61), which suggested that most 
applicants were either not faking or were not faking to a great extent.  
On the surface, the within-subject studies by Ellingson et al. (2007) and Griffith et 
al. (2007) appear to suggest that, although individuals are capable of faking on 
personality tests, they do not tend to do so to a troublesome degree in applied contexts. 
However, there are a few standout characteristics of these studies that might suggest 
otherwise.  
Ellingson et al. (2007) compared individuals in selection and developmental 
conditions and concluded that faking is not a problem on the CPI, but their study is 
limited in two key ways. First, unlike most personality measures that use polytomous 
 
18 
 
response scales (e.g., Likert-type scales with 5-7 response options), the CPI features a 
dichotomous true-false response scale. As such, test-takers are limited in the degree to 
which they can distort their responses on the CPI without completely misrepresenting 
their true personalities. Whereas polytomous response scales allow users to inflate their 
responses to a small degree (e.g., from moderately agree to strongly agree), users of the 
CPI must transition from one extreme of the scale to the other (i.e., from true to false) in 
order to fake their responses. A second limitation to this study involves the data analysis. 
Specifically, these researchers averaged scores on all 20 of the CPI?s scales together 
before correcting for test-retest time delay and feedback effects. After making score 
corrections and averaging the 20 CPI scale scores together, these researchers found 
minimal changes in scores. However, prior to correction, the test-retest effect sizes for the 
individual CPI scales were as high as 0.64, which suggests that at least some scales 
exhibited substantial score changes. Upon further consideration, Ellingson et al.?s 
conclusions are limited in that they may only apply to dichotomous personality tests, 
which constitute only a small percentage of personality tests used by organizations, and 
potentially unfounded because they computed score differences across the CPI?s scales, 
rather than within the scales individually.  
Additional characteristics warrant identification in the Griffith et al. (2007) study. 
These researchers focused their within-subjects study on a single measure of 
conscientiousness and determined that, although individuals? scores from applicant and 
respond honestly conditions correlated significantly with one another, they did not 
correlate significantly with scores from the faking condition. However, by comparing the 
rank order of the top 10 applicants across conditions, Griffith et al. demonstrated that 
 
19 
 
many of their participants would have stood less a chance of being hired had selection 
decisions been based on their respond honestly scores. Indeed, six of the top 10 scorers in 
the applicant condition exhibited score changes of more than half a standard deviation 
when asked to respond honestly, with an average effect size of 0.71. Thus, although 
applicant scores correlated significantly with responses from the honest condition, the 
two conditions resulted in vastly different rank orderings of participants? 
conscientiousness scores. If the organization in this study made its selection decisions 
using a top-down strategy, there is a good chance that a different group of individuals 
would have gained employment had the current group of applicants provided more honest 
responses than those they actually provided.      
As a group, the within-subjects studies of applicant personality test faking provide 
mixed results. Although they are methodologically more powerful than between-subjects 
designs, because they eliminate any true group differences, it appears that differences in 
measurement and analytic procedures result in different findings. As an alternative 
research design, some researchers have chosen not to define faking using score changes 
between response conditions, but have instead operationalized faking using scales 
designed to assess socially desirable response patterns. The assumption behind this 
alternative methodology is that some measures of social desirability may serve to identify 
applicants that intentionally manipulate their responses to personality test items, or 
otherwise ?fake good? on these tests.   
Traditionally, researchers have conceptualized social desirability as a 
unidimensional construct that describes individuals? tendencies to present themselves 
favorably on self-report items (e.g., Crowne & Marlowe, 1960; Ones, Viswesvaran, & 
 
20 
 
Reiss, 1996). However, Paulhus (1984) recognized that there are actually two distinct 
dimensions of social desirability that differ based on intention. The first dimension is self-
deception enhancement, which Paulhus described as an unintentional or natural tendency 
to consider oneself favorably, but falsely. With self-deception, individuals truly believe 
their positive self-enhancements, regardless of the accuracy of these beliefs. In contrast, 
the second form of social desirability, termed impression management, represents a 
deliberate misrepresentation of oneself. This dimension more closely relates to the 
concept of faking because it involves purposeful response distortion. Indeed, these 
concepts are so similar that researchers often use the terms faking and impression 
management interchangeably in the personality testing literature (e.g., Hogan et al., 2007; 
Mueller-Hanson, Heggestad, & Thornton, 2006).  
To assess the dual conceptualization of social desirability, Paulhus (1984, 1991; 
1998) constructed the Balanced Inventory of Desirable Responding (BIDR), a 40-item 
inventory containing two subscales, one for self-deception enhancement and one of 
impression management. According to Paulhus (1984), researchers can use the BIDR in 
conjunction with self-report personality tests to control for the effects of dishonest 
responding (i.e., impression management scores). Li and Bagger?s (2006) meta-analysis 
of studies using the BIDR and Big Five personality measures found that applicant scores 
on the impression management scale correlated most highly with conscientiousness (? = 
.42, SD = .11) and agreeableness (? = .42, SD = .11), the two personality dimensions 
considered most related to job performance (see Barrick & Mount, 1991; Tett et al., 
1991). These results suggested that applicants who score highly on the most desirable 
personality scales also tend to endorse impression management items, which could be 
 
21 
 
troublesome for organizations selecting on personality, as applicants? scores may not 
represent honest responding. However, when Li and Bagger partialed impression 
management from personality, the criterion-related validity of the personality 
assessments remained essentially unchanged. Other researchers using score-correction 
methodologies have found similar results (e.g., Barrick & Mount, 1996; Christiansen, 
Goffin, Johnston, & Rothstein, 1994; Griffith, Malm, English, Yoshita, & Gujar, 2006), 
and these findings have been used as evidence that faking is not a legitimate concern 
when it is operationalized using scores on impression management scales (e.g., Ones et 
al., 1996). Nevertheless, the BIDR and related scales remain in use today, as some 
researchers believe they assess a construct of considerable importance to employment 
contexts (e.g., Smith & Ellingson, 2002; Viswesvaran, Ones, & Hough, 2001).  
Is faking a problem in the real world? Organizational researchers have used each 
of the methodologies described above not only to determine if and to what extent 
applicants fake, but also to determine if faking actually matters. Studies like Griffith et 
al.?s (2007) examine the effects of faking on the criterion-related validity of personality 
tests, such as whether faking affects the rank-order of applicants. Other studies 
investigate the effects of faking on the construct validity of personality tests. For 
example, using the direction manipulation methodology, Ellingson, Sackett, and Hough 
(1999) compared scores from a respond honestly and a fake good group on a 
multidimensional personality test used for selection by the U.S. Army. A confirmatory 
factor analysis (CFA) for the respond honestly condition confirmed the intended factor 
structure for the test. However, the CFA for the faking condition did not support the 
hypothesized factor structure of the instrument. A follow-up analysis revealed that the 
 
22 
 
data from the faking group actually supported a unidimensional model of personality. 
Therefore, Ellingson et al. (1999) concluded that personality test faking can affect 
construct validity by reducing (or eliminating) the factor structure of the test. Because 
changes in the factor structure would change the interpretation of the test results, the use 
of this measure for making employment decisions would be highly questionable.   
In another directed faking study, Douglas, McDaniel, and Snell (1996) compared 
the performance appraisal ratings of a sample of honest respondents to a sample of 
respondents instructed to fake good and found significant differences in the criterion-
related validity for each group (.31 and -.01, respectively). These researchers concluded 
that, when a large number of people fake, the predictive validity of personality tests 
lowers substantially.  
Using a between-subjects design, Schmit and Ryan (1993) compared the structure 
fit of a Big Five personality test using data from applicants and non-applicants. Their 
CFA confirmed the five-factor structure for the non-applicant data, but not for the 
applicant data. A follow-up analysis revealed that the applicant group produced a sixth 
factor consisting of items intended for each of the five personality scales and to which 
agreement indicated an extremely positive self-bias (e.g., hard-worker, likable, 
committed). Schmit and Ryan labeled this factor the ?ideal-employee factor? (p. 970) and 
concluded that faking has the potential to disrupt the construct validity of personality 
tests by introducing unintended scales to these measures.  
Using the faking scale methodology, Rosse, Stecher, Miller, and Levin (1998) 
administered a Big Five measure and the BIDR to samples of applicants and incumbents. 
Applicants scored significantly higher than incumbents on agreeableness, 
 
23 
 
conscientiousness, emotional stability, and extraversion, as well as on the impression 
management scale of the BIDR, which suggested that some applicants intentionally 
distorted their responses. Similar to Griffith et al. (2007), Rosse et al. then rank ordered 
the applicants by their personality scores and found that, when using a top-down selection 
strategy, applicants with the highest impression management scores were overrepresented 
at the top of the distribution. Indeed, if the top 5% of applicants were hired, seven of 
eight would have impression management scores considered extreme (i.e., 3+ standard 
deviations above the mean). This ratio dropped to 9 of 16 when the top 10% were hired, 
though this proportion still equated to over 50% of selected applicants having impression 
management scores indicative of severe intentional distortion.  
Most recently, Peterson, Griffith, O?Connell, and Isaacson (2008) utilized a 
within-subjects design to investigate whether applicant faking predicted individuals? 
engagement in counter-productive work behaviors (CWBs) once they were hired. A 
preliminary analysis found a non-significant correlation between individuals? 
conscientiousness scores as applicants and their CWB scores as incumbents. However, 
after removing data from the participants who exhibited statistically significant 
conscientiousness score changes between test administrations, Peterson et al. obtained a 
statistically significant improvement in the criterion-related validity of their measure, 
which suggested that faking affected the criterion-related validity of the 
conscientiousness scale. Interestingly, applicants? scores on the social desirability scale 
were unrelated to their conscientiousness scale score changes, suggesting that social 
desirability scales may not be useful indicators of actual applicant faking. However, 
because these researchers used a unidimensional social desirability scale (i.e., the 
 
24 
 
Marlowe-Crowne short form; Crowne & Marlowe, 1960), they were unable to discuss 
these findings in terms of impression management alone (i.e., in the absence of self-
deception enhancement). Nevertheless, applicants? social desirability scores correlated 
significantly with their CWB scores as incumbents, thereby supporting the argument that 
some social desirability response patterns may legitimately predict meaningful work 
outcomes (e.g., Smith & Ellingson, 2002; Viswesvaran et al., 2001).  
Summary of the faking problem. Previous research on personality test faking has 
assumed a number of forms. Some studies investigated whether tests can be faked while 
others asked if they actually are faked. To address these concerns, researchers have 
sampled from non-applicants, true applicants, or some combination of the two. The 
results of these studies vary by sample, context, and test, but the general conclusions are 
that personality tests are susceptible to faking and that faking occurs in actual selection 
settings (e.g., Birkeland et al., 2006; Griffith et al., 2007; Rosse et al., 1998; Viswesvaran 
& Ones, 1999). Although researchers cannot determine the true prevalence of applicant 
faking (estimates range between 15% and 63% of applicants fake; e.g., Dunnette, 
McCartney, Carlson, & Kirchner, 1962; Donovan, Dwight, & Hurtz, 2003; Griffith et al., 
2007), the knowledge that even some applicants fake presents the more critical question: 
Is faking a problem?  
Research on the consequences of applicant faking has produced mixed results. 
Some studies suggest that faking may disrupt the construct and criterion-related validity 
of personality tests (e.g., Schmit & Ryan, 1993; Ellingson et al., 1999; Douglas et al., 
1996; Rosse et al., 1998). However, methodological weaknesses and confusion regarding 
the interpretation of many faking studies may limit their applied value (Ones et al., 2007). 
 
25 
 
Additional research utilizing applied samples and comparatively more advanced analytic 
techniques than the majority of the existing faking research may offer new perspectives 
on the issue of applicant faking and its associated effects on personality measures. 
Measurement Equivalence and the DFIT Framework 
Measurement equivalence. A principle assumption behind psychological 
measurement is that a well-developed scale (or item) may be used to make inferences 
about some unobservable characteristic(s) of the test taker. These unobservable 
characteristics, or latent constructs, include abilities (e.g., intelligence), attitudes, and 
traits (e.g., personality characteristics). With valid psychological measures, individuals? 
raw scores provide observable indicators of their standing on the latent construct as 
measured by the scale or item (Lord & Novick, 1968). It follows that, with valid 
psychological measures, any two people with identical standing on a latent construct 
should produce the same expected scale or item-level scores (Drasgow & Kanfer, 1985). 
This testing property represents the concept of measurement equivalence.  
Measurement equivalence occurs when ?the relations between observed scores 
and latent constructs are identical across relevant groups? (Drasgow & Kanfer, 1985, p. 
662). Measurement equivalence is necessary in order to make score comparisons across 
different groups of test takers. In the context of employment selection, a test must 
demonstrate measurement equivalence if the organization intends to use scores on the test 
to make selection decisions. For tests that lack measurement equivalence (i.e., result in 
differential functioning), the comparison of scores for members of different groups 
becomes difficult, if not impossible. Researchers can assess measurement equivalence (or 
differential functioning) at the item-level as well as the scale-level. Among the statistical 
 
26 
 
techniques available to assess measurement equivalence, the differential functioning of 
items and tests (DFIT) framework proposed by Raju et al. (1995) has proven particularly 
useful for organizational researchers.  
The DFIT procedure. The DFIT framework is based in item response theory 
(IRT), which assumes a nonlinear relationship between individuals? latent trait/ability 
levels, termed theta levels, and their observed scores on a test item or scale (Lord & 
Novick, 1968). Different IRT models exist for different types of items. For 
dichotomously scored items (i.e., items with two response alternatives; e.g., 
correct/incorrect, true/false), IRT models estimate the probability that an individual will 
respond to the item successfully based on the individual?s theta level and the 
characteristics of the test item, such as the item difficulty and item discrimination 
parameters (Hambleton & Swaminathan, 1985). This relationship is represented 
graphically as the item characteristic curve, or item response function (IRF; see Figure 1). 
Most personality tests, however, do not feature dichotomous scoring formats, opting 
instead for multiple response categories. Such tests require polytomous IRT models.   
Because polytomous items feature multiple response categories, there are no true 
correct or incorrect responses to these times. Rather than estimate the probability of 
answering the item successfully, as found in dichotomous IRT models, polytomous IRT 
models estimate the probability that an individual will respond to each response category 
given the individual?s theta level (Oshima, Kushubar, Scott, & Raju, 2009). Thus, for 
every response category, polytomous IRT models estimate a separate IRF, which are 
termed category response functions (CRFs; see Figure 2). As depicted in Figure 2, at any 
given theta level, the sum of the probabilities of the CRFs should equal one. For a more 
 
27 
 
detailed description of dichotomous and polytomous IRT models, see Oshima et al. 
(2009). 
Organizational researchers have been relatively slow to incorporate IRT models 
into their research partly because IRT models require complex statistical equations (Ellis 
& Mead, 2002). However, computer programs are becoming increasingly available for 
estimating and equating IRT item parameters, thereby simplifying the analysis of 
measurement equivalence for work-related tests. The DFIT computer program based on 
Raju et al.?s (1995) framework, for instance, recently published its eighth edition 
(Oshima et al., 2009). Raju et al.?s DFIT procedure is one of the many IRT-based 
methods for assessing measurement equivalence, with the major difference being the 
DFIT procedure assess the differential functioning of both items (DIF) and tests (DTF), 
and applies to dichotomous as well as polytomous models. These qualities make the 
DFIT framework appropriate for examining the measurement equivalence of most 
personality measures used by organizations for personnel decision making.  
The DFIT procedure provides an estimate of measurement equivalence (or 
differential functioning, conversely) by comparing the item parameters of two subgroups 
of respondents: the focal group and reference group. Users of the DFIT methodology 
define group membership, and can create groups based on any relevant characteristics of 
the examinees (e.g., race, gender, intelligence). After placing the item parameters on the 
same scale as the focal group, the DFIT program compares the IRFs or CRFs for each 
group, depending on whether the item is dichotomous or polytomous. Measurement 
equivalence exists when the response functions are identical. If the response functions 
 
28 
 
differ, then there is evidence of differential functioning or measurement inequivalence 
(Oshima et al., 2009).   
The DFIT program produces three indices of interest. The first index is the 
noncompensatory DIF (NCDIF) index, which is an item-level estimate of differential 
functioning that focuses on each item individually. In calculating NCDIF, the DFIT 
program assumes all other items are free from DIF. Users of this program tend to rely on 
the NCDIF index when making decisions regarding item-level data. The second index is 
the compensatory DIF (CDIF) index, which is an item-level estimate of differential 
functioning that does not assume all other items are free from DIF. Unlike with NCDIF, 
the CDIF index accounts for correlated differential functioning among test items; 
therefore, items that exhibit DIF in opposing directions (i.e., one item favors the focal 
group and another favors the reference group ) can cancel each other out using the CDIF 
indices (Henry & Raju, 2006). The third index calculated by the DFIT program is the 
DTF index, which is a  scale-level estimate of differential functioning. Users of the DFIT 
program that are concerned with the scale-level performance of a measure tend to use the 
DTF index in conjunction with the CDIF indices to delete items individually until the 
scale no long exhibits significant DTF. This strategy results in the deletion of fewer items 
than if one were to remove items based on the NCDIF values; however, it also permits 
the resulting modified scale to contain items with significant CDIF (i.e., assuming these 
items exhibit DIF in opposing directions).            
As a prototypic example of the application the DFIT methodology to personality 
testing, Mitchelson, Wicher, LeBreton, and Craig (2009) recently evaluated the Abridged 
Big Five Circumplex of personality traits for differential functioning by gender and 
 
29 
 
ethnicity. Of the personality measure?s 45 scales, 17 displayed NCDIF for gender and 28 
displayed NCDIF based on ethnicity (33 of 45 scales exhibited DIF, altogether). Findings 
of DIF were not uniform in that some items favored women while others favored men, 
and some items favored Caucasians while others favored African Americans. Therefore, 
in many cases the CDIF indices cancelled each other out, resulting in no evidence of DTF 
for gender and evidence of only two scales exhibiting DTF by ethnicity. Depending on 
one?s level of focus (i.e., item-level or scale-level), evidence of differential functioning in 
this study may or may not be viewed as problematic. As most organizations consider 
scale-level scores in making personnel decisions, evidence of DTF on only 2 of 45 may 
not pose a serious threat to test validity. However, if one?s theoretical orientation 
considers each individual item as a test in and of itself, then the finding that 33 of 45 
scales exhibited DIF may be an issue of sizeable concern. Either way, this study 
demonstrated the practical use of the NCDIF, CDIF, and DTF indices and provided 
encouragement for organizational researchers interested in using the DFIT methodology 
for assessing measurement equivalence on personality items and scales.     
DFIT and the faking problem. Most organizational studies that employ the DFIT 
methodology compare members of protected groups (e.g., men versus women, 
Caucasians versus racial minorities) as a means of determining compliance with equal 
employment laws. Relatively few studies have applied the DFIT framework to examine 
measurement equivalence among other relevant groups. Nevertheless, the DFIT 
methodology appears in a limited selection of research investigating applicant faking by 
comparing groups of fakers versus non-fakers on a variety of personality tests (e.g., 
Flanagan & Raju, 1997; Henry & Raju, 2006; Robie, Zickar, & Schmit; 2001; Stark, 
 
30 
 
Chernyshenko, Chan, Lee, & Drasgow, 2001; Zickar & Robie, 1999). Table 2 provides a 
summary of the methodologies and results of these studies.  
As Table 2 indicates, DFIT researchers investigating the measurement 
equivalence of personality tests across groups of fakers and non-fakers have employed a 
variety of methodologies, often with conflicting results. For instance, these researchers 
have grouped fakers and non-fakers using (a) fake good and respond honest instructions, 
(b) high and low/average impression management scores, (c) applicants and non-
applicants, or (d) a combination of the latter two grouping methods. The personality 
measures examined thus far include the 16-PF, the military?s ABLE test, the Personal 
Preference Inventory, and an empirically derived conscientiousness scale from the CPI. 
Overall, the results of these studies suggest that faking can disrupt the measurement 
equivalence of some personality measures, but that DIF and DTF are more likely to occur 
when researchers instruct examinees to fake good or when they compare applicants to 
non-applicants. DFIT studies comparing high to low/average impression management 
groups indicate that personality measures tend to function equivalently for members of 
these groups (Flanagan & Raju, 1997; Henry & Raju, 2006; Stark et al., 2001). A closer 
examination of the limitations of these studies, however, provides a need for additional 
DFIT research on the faking problem before one can draw any conclusions from the 
extant research employing this methodology.    
Flanagan and Raju (1997) and Stark et al. (2001) suggested that faking is not a 
problem for the 16-PF when grouping respondents using scores on the 16-PF?s 
impression management scale. However, when comparing groups of applicants and non-
applicants, Stark et al. found evidence of differential functioning. One potential limitation 
 
31 
 
to these studies that might explain these divergent results is that the 16-PF uses a three-
category response scale. Thus, there is little room for individuals to engage in impression 
management without completely distorting their responses. It may therefore come as little 
surprise that analyses comparing groups based on the 16-PF?s impression management 
scale did not produce evidence of DIF or DTF, as respondents are restricted in their 
ability to fake good on this scale. An impression management scale with more than three 
response options, such as a five-point Likert-type scale, may well produce different 
results, because it would permit individuals to inflate their responses without 
misrepresenting their beliefs entirely.  
This same issue is also a notable limitation to Henry and Raju?s (2006) study, as 
the CPI uses a dichotomous, True/False response scale. Again, respondents are restricted 
in the degree to which they can fake on the CPI, which might explain why these 
researchers did not find faking to affect the measurement equivalence of their empirically 
derived conscientiousness scale. Henry and Raju?s choice of personality measurement, 
too, may limit the results of their study. Although conscientiousness is an important job-
related personality variable, the CPI does not directly assess this construct. Researchers 
may obtain different results using a more established measure, one designed with the 
explicit purpose of assessing conscientious.   
Using the military?s ABLE personality test, Zickar and Robie (1999) 
demonstrated that when instructed to fake good individuals are capable of disrupting the 
measurement equivalence of a personality test. These researchers found DTF on two of 
the three scales they examined when comparing groups instructed to fake good and 
respond honestly. However, direction manipulation studies are historically limited in that 
 
32 
 
respondents who are instructed to fake good under controlled conditions tend to fake 
more than applicants in a true selection context. Thus, although this study demonstrated 
that DTF can result from faking, it may not represent true applicant behavior. 
Researchers may obtain different results using samples of actual job applicants rather 
than individuals instructed to behave like applicants.  
Finally, Robie et al. (2001) concluded that the PPI functions equivalently for 
applicants and incumbents: They found little evidence of DIF or DTF in their study. 
However, their sample size was below the recommended sample size for performing 
IRT-based analyses, which may have biased the results. Robie et al.?s focal group 
consisted of 999 applicants and their reference group consisted of 796 incumbents. 
Ideally, IRT-based analyses require 200 respondents per response option (Drasgow, 
1989). Because the PPI uses a five-point rating scale, each group would have needed 
1,000 participants in order to meet these recommendations. Researchers may obtain 
different results using larger samples than used by Robie and colleagues. 
Research using the DFIT procedure to test for measurement equivalence on 
personality measures across groups of fakers and non-fakers has produced mixed results, 
with some studies finding considerable evidence of DIF and DTF (e.g., Stark et al., 2001; 
Zickar & Robie, 1999) and others not (e.g., Flanagan & Raju, 1997; Henry & Raju?s, 
2006; Robie et al., 2001 ). One potential reason for these differences derives from 
different conceptualizations of faking. Some studies conceptualized faking using 
impression management scores, while others compared applicants to non-applicants or 
honest to fake-good groups. Another reason for a lack of converging evidence concerns 
measurement issues. For instance, some studies used personality scales with relatively 
 
33 
 
few response options, thereby minimizing the degree to which respondents could fake 
good. Also, studies such as Robie et al.?s did not achieve the requisite sample size for the 
analyses they performed. As Table 2 indicates, relatively few faking-related DFIT studies 
have focused on the same model of personality, let alone the same measure. The studies 
described above followed some combination of Cattell?s framework (Cattell et al., 1970), 
the five-factor model (Goldberg, 1981), Hogan?s socioanalytic theory (1982), folk con-
cepts (Gough & Bradley, 1996), and models of managerial job performance (Davis, 
Skube, Hellervik, Gebelien, & Sheard, 1996). Although the Big Five is the most 
commonly accepted model, none of the existing DFIT research examining fakers and 
non-fakers has done so with a true Big Five measure.  
The Current Research 
This dissertation developed from both a research-oriented as well an applied need. 
From a research perspective, this dissertation sought to investigate the issue of applicant 
personality test faking further through application of the IRT-based DFIT methodology. 
This methodology is statistically more rigorous than the majority of faking research, as it 
analyzes differences in response functioning between fakers and non-fakers as opposed to 
differences in total scores. One reason why this methodology does not appear more often 
in faking studies is that IRT analyses require considerably large sample sizes. The current 
research adds value to the faking literature by providing data from over 20,000 real world 
job applicants who completed the measures in a real world selection context.  
From an applied perspective, this dissertation developed from the need to 
determine whether applicants fake on the Fitability 5a personality inventory, and, if so, to 
determine if applicant faking was a problem for this measure. Thus, in Study 1 an 
 
34 
 
impression management scale was developed for exclusive use with the Fitability 5a to 
assess applicant faking on this test. The Fitability 5a impression management scale is 
unique to this Big Five measure in that its items were written to appear similar to the 
Fitability 5a?s items and feature the same response format. These properties make it less 
likely that applicants will identify the impression management items for what they truly 
are, and will instead perceive them to assess the traditional Big Five traits instead. In 
Study 2, the new impression management scale was implemented in an actual selection 
context in which applicants completed the Fitability 5a as part of the job candidate 
screening process. After dividing the applicant sample into high and low scorers on the 
impression management scale, measurement equivalence was examined across these 
groups on all 55 items and five scales of the Fitability 5a to determine if faking produced 
DIF or DTF on this measure. This latter effort also served to validate the impression 
management scale. 
 This research differs from similar DFIT research investigating the effects of 
applicant faking on the measurement equivalence of personality-based selection measures 
in a number of ways. First, the sample not only exceeded the recommended sample size 
for IRT-based research, but also consisted of real-life job applicants completing the 
measures in a true selection context. Second, this study conceptualized faking using high 
and low scorers on an impression management scale designed exclusively to detect 
applicant faking using a five-point Likert-type scale. Third, this study assessed the 
measurement equivalence of all five factors of a true Big Five personality test; that is, 
none of the scales were derived empirically or post hoc, as in the case of Henry and Raju 
(2006). For these reasons, this dissertation serves as a logical next step to investigating 
 
35 
 
the faking problem using scores on a custom-built impression management scale and the 
IRT-based DFIT methodology, thereby addressing both research-oriented and applied 
needs. 
 
36 
 
 
 
 
Chapter 3 
Study 1: Development of the Fitability 5a Impression Management Scale 
 Organizations have at their disposal a variety of measures for assessing the Big 
Five factors of personality: agreeableness, conscientiousness, extraversion, neuroticism, 
and openness to experience. Although they may be psychometrically sound, many of 
these measures consist of hundreds of items, which could make them impractical for use 
in some employment contexts. As a personality assessment alternative, the 50-item 
Fitability 5 was developed with the specific goal of assessing the Big Five factors in a 
quick, yet accurate manner. Lucius (2003) demonstrated that this instrument?s five scales 
converged onto the Big Five taxonomy, with scale score correlations ranging from .69 to 
.88 with established five-factor model personality measures (Barrick & Mount, 1994; 
John & Donahue, 1994). The Fitability 5 scales also demonstrated sufficient evidence of 
internal consistency (coefficient alphas ranging from .74 to .85) and criterion-related 
validity, with employees? scores correlating significantly with such job-related variables 
as worker well-being, self-esteem, satisfaction, motivation, and productivity. However, a 
comparison of applicants? scores with those who completed the Fitability 5 for 
developmental purposes revealed considerable score differences on 14 of the measure?s 
50 items. Under suspicion that members of the applicant sample may have intentionally 
distorted their responses to create a more favorable impression, the Fitability 5 underwent 
revision to become more resistant to faking. The revised measure, the Fitability 5a, was 
 
37 
 
considered ?new and improved? and ?specifically designed for job applicant populations? 
(Lucius, p. 40).     
 Fitability Systems has hosted the Fitability 5a for online applicant screening 
purposes for more than 5 years, and currently administers the test to over 10,000 
applicants a month, including Fortune 500 firm applicants. Although the test developers 
designed this measure to be more resistant to faking than the original Fitability 5, 
observations since the revision have revealed that applicants respond differently than 
incumbents and volunteers on the Fitability 5a as well. Rather than revise the items a 
second time, Fitability Systems contacted members of the industrial and organizational 
program at Auburn University, including the current author, to determine an alternative 
method for resolving their potential faking problem. The agreed upon solution provided 
the rationale for Study 1 of this dissertation: The development of the Fitability 5a 
impression management scale.      
Test administrators often rely on impression management scales to detect 
examinees? engagement in intentional response distortion (as opposed to self-deception 
enhancement; Paulhus, 1984). Impression management measures such as Paulhus? (1984, 
1998) BIDR have been used extensively in academic and applied research. However, 
some practitioners may be hesitant to use the BIDR for personnel selection due to its 
impractical length (40 items), because it was designed for general purposes (i.e., not 
designed for employment contexts, per se), or because it utilizes a unique response scale 
(1 = Not true to 7 = Very true). Because existing off-the-shelf impression management 
scales did not fit needs of Fitability Systems, the development of a new impression 
management scale, one that could detect applicants? use of impression management in as 
 
38 
 
few items as possible, was deemed necessary. Further, because this scale would be 
included as the sixth scale of the Fitability 5a, its items would need to appear similar to 
items on the Fitability 5a in terms of length, reading level, and response format. As part 
of the validation process, scores on this scale would also need to exhibit a significant and 
positive relationship with scores of a measure designed to assess the same construct. 
Therefore, Paulhus? (1998) BIDR was also used in Study 1 for construct validation.  
Method 
Item development. To assure the Fitability 5a impression management scale 
would appropriately assess the construct of interest, relevant theory guided the scale 
development process (DeVellis, 2003). Thus, it was necessary to distinguish between the 
two forms of social desirability: impression management and self-deception enhancement 
(Paulhus, 1998). Because impression management represents an individual?s attempt to 
misrepresent his or her true self deliberately, items were constructed specifically to assess 
the intentional form of social desirability.   
Sixty impression management-type items were developed for the initial item pool. 
Although most items were original, existing impression management items (e.g., the 
International Personality Item Pool, 2001) and scales (e.g., the impression management 
scale of Paulhus? BIDR, 1998) served as guides during the item-writing process. Care 
was taken to avoid items that were lengthy, difficult to read, double-barreled, double 
negative, or ambiguous. Of the original 60 items, half were reverse coded. Four experts 
with backgrounds in individual differences and personality testing reviewed the initial 
item pool. The following set of instructions was provided for completing this task:  
 
39 
 
As part of the scale development process, subject matter experts are required to 
review the initial pool of items for the scale. Please examine the following items 
for grammar, clarity, and conciseness. Also, please evaluate whether each item 
fits the format of the Fitability test items as well as determine if each item is 
relevant for assessing impression management response patterns. Please make 
note of any items that require modification or elimination. 
Based on the experts? feedback, 25 items were modified and five items were eliminated 
altogether. Reasons for item modification and elimination included the following: 
? Too wordy 
? Improper grammar  
? Too negative sounding 
? Too personal sounding 
? Stated as an absolute (e.g., ?I always,? ?I never?) 
? Unclear wording 
? Did not appear to address impression management 
This process resulted in a final item pool consisting of 55 items for administration to the 
developmental sample.  
Participants and procedure. Participants were 300 undergraduate students (78.3% 
female, 86.7% Caucasian) who took part in the study for extra credit in psychology 
courses at Auburn University. Participants were divided into two distinct samples 
(Samples A and B) that completed the measures under different instructional sets. 
Participants completed these measures by following a Web link to SurveyMonkey, an 
online survey program used to host the measures. All data were collected anonymously. 
 
40 
 
Participants received extra credit if they accessed the SurveyMonkey program, even in 
the absence of completing the measures. The following sections describe the 
methodologies used for Sample A followed by Sample B.  
Sample A participants and procedure. Sample A consisted of 54 (77.8% female, 
89.1% Caucasian) participants. These participants completed the measures based on the 
direction manipulation paradigm used by researchers at Auburn University (e.g., 
Winkelspecht, et al., 2006; Teague & Thomas, 2008) in their investigations of personality 
test faking and applicant response patterns. The specific instructions used for Sample A 
appeared to participants as follows: 
Respond Honestly instructions: The test you are about to take will be used as an 
aid for making a hiring decision for the position of SALESMAN/SALESWOMAN 
1. Please carefully adhere to the following request: ANSWER ALL QUESTIONS 
HONESTLY 
Most Favorable Impression instructions: The test you are about to take will be 
used as an aid for making a hiring decision for the position of 
SALESMAN/SALESWOMAN 1. Please carefully adhere to the following request: 
ANSWER ALL QUESTIONS IN A WAY THAT YOU BELIEVE WILL PRESENT 
THE MOST FAVORABLE IMPRESSION OF YOU AS A 
SALESMAN/SALESWOMAN. 
Data from the first instruction condition (Respond Honestly) represented the respondents? 
true scores. Data from the second instruction condition (Most Favorable Impression) 
represented the respondents? scores as job applicants maximizing their use of impression 
 
41 
 
management. The results from Sample A assisted in identifying the items that were most 
susceptible to score changes based on the two instruction conditions.  
Sample A measures. Participants completed three different measures: the 
Fitability 5a, the impression management item pool, and a modified version of the BIDR. 
Items for each of these measures appear in Appendix A. 
 First, participants completed the Fitability 5a, a 55-item Big Five personality test 
containing scales ranging from 10 to 12 items. Previous internal consistency estimates the 
measure?s five scales ranged from .69 to .75 (Lucius, 2003). For the current study, the 
items on the Fitability 5a used a five-point Likert-type response scale ranging from 1 =  
Strongly disagree to 5 = Strongly agree. 
Second, participants completed the 55 items constructed for the Fitability 5a 
impression management scale (described above). Participants responded to these items 
using the same five-point Likert-type response scale as the Fitability 5a (1 =  Strongly 
disagree to 5 = Strongly agree). 
Third, participants completed a modified version of Paulhus? (1998) BIDR, one of 
the most widely used, reliable, and valid social desirability scales in existence (Li & 
Bagger, 2006). The unmodified version of the BIDR consists of 40 items divided equally 
between two 20-item subscales: self-deception enhancement (e.g., I never regret my 
decisions) and impression management (e.g., I never swear). These subscales differ in 
that self-deception enhancement measures unintentional self-promotion whereas 
impression management measures purposeful response distortion. Respondents indicated 
their level of agreement with each item using a seven-point Likert-type scale (1 = Not 
true to 7 = Very true). The modification made for the current study consisted of 
 
42 
 
eliminating two items for being inappropriate to an employment context. These items 
were I have sometimes doubted my ability as a lover and I never read sexy books or 
magazines.  
Sample A analyses and results. Separate analyses were conducted for Samples A 
and B. Sample A data were analyzed first to identify which items were most susceptible 
to score changes based on the two instruction conditions. Because Sample A participants 
responded to the impression management items honestly and by presenting the most 
favorable impression, analyses using these data would inform the scale development 
process by determining which items respondents were best able to fake good. Sample A 
analyses would also benefit the scale development process by identifying which items 
were least susceptible to score changes between the two instructional conditions. Items 
that produced little to no score changes between the instruction conditions would be 
eligible for elimination from the initial item pool, as these items would be considered 
least susceptible to faking.  
An examination of Sample A?s descriptive statistics for the Respond Honestly 
condition revealed no departures from normality. However, the data from the Most 
Favorable Impression condition displayed several departures from normality, as the item 
scores tended to be biased toward the high end of the response scale (Strongly agree). 
These results suggested that respondents were endorsing the items intended for the new 
scale more often in the Most Favorable Impression condition than in the Respond 
Honestly condition.  
Next, a paired-samples t-test and effect size (Cohen?s d) analysis were conducted 
to determine whether Sample A respondents were able to change their scores 
 
43 
 
significantly and meaningfully based on the different instruction conditions. Forty-eight 
items exhibited medium to large effect size differences between the two instructional 
conditions. These items were considered optimal for inclusion in the final scale as these 
items produced the most meaningful score differences between the Respond Honestly 
and Most Favorable Impression conditions. The seven items that produced small effect 
size differences were removed from all additional analyses, with the rationale that these 
items would be unlikely to identify impression management behavior. Following the 
Sample A analyses, the impression management item pool consisted of 48 items. 
Sample B participants and procedure. Sample B consisted of 247 (78.1% female, 
83.8% Caucasian) undergraduate students. These participants completed the measures 
using the instructions typically provided with the Fitability 5a. Specifically: 
Instructions: Below are a series of statements that broadly describe an 
individual's personality. Indicate whether you agree or disagree with each 
statement as it applies to you by "clicking" on the appropriate response. There 
are no right or wrong answers, nor is there an "ideal" response for each question. 
Attempting to misrepresent your true personality may actually work against you. 
The best approach is to simply respond truthfully. Do not think too much about 
your answer - go with your first impression.  
Sample B measures. Sample B participants completed the same measures as the 
Sample A participants: the Fitability 5a, the impression management item pool, and a 
modified version of the BIDR.  
Sample B analyses and results. An examination of the descriptive statistics from 
Sample B indicated that all items were normally distributed. However, 10 items exhibited 
 
44 
 
biased responses patterns (e.g., respondents did not use the entire scale; responses 
favored either Strongly disagree or Strongly agree), and were therefore removed from the 
item pool. Following this elimination, 38 items remained. 
The remaining 38 items were analyzed using exploratory factor analysis (EFA). 
According to DeVellis (2003), EFA is the ?best means of determining which group of 
items, if any, constitute a unidimensional set? (p. 94). Before conducting the EFA, the 
data from Sample B were evaluated for appropriateness. The ratio of observations to 
variables was moderate, but acceptable (6.5:1). This assumption was confirmed using 
Bartlett?s test of sphericity (?2 = 2,657.04, df = 703, p < .001), which determined that the 
correlations, when taken collectively, were significant. Further, the overall Kaiser-Meyer-
Olkin measure of sampling adequacy (MSA) provided that the items were correlated and 
that the pattern of variable correlations was suitable for EFA (MSA = .80). An 
examination of the partial correlation matrix found that no one variable had a partial 
correlation with any other variable greater than ?.5. Taken together, these checks 
established that all 38 items met the requirements for EFA (Hair, Black, Babin, 
Anderson, & Tatham, 2006).  
The next stage in conducting the EFA was to determine the number of factors to 
extract. Although the items for the new scale were intended to assess a single construct 
(i.e., impression management), it was considered possible that different subsets of items 
might yield different conceptualizations of the latent construct. The rationale for this 
possibility came from Schmit and Ryan (1993), who suggested that faking-based scales 
may consist of items that are directed toward different target objects. Using principal 
components analysis and the scree plot method (see Figure 3), it was decided to retain 
 
45 
 
four factors. Next, an EFA was conducted with a forced four-factor solution using direct 
oblimin rotation (maximum likelihood extraction), as both theory and the previous 
analyses suggested that the resulting solution would produce correlated factors. The 
resulting pattern matrix produced four factors, with 10 items on the first factor, 4 on the 
second, 6 on the third, and 11 on the fourth. The four-factor solution accounted for 30.4% 
of the total variance. This initial four-factor solution was examined for potential 
improvements based on alpha-if-item-deleted estimates and item similarity. Based on 
these criteria, two items were eliminated from Factor 1 and two items were eliminated 
from Factor 4. This revision yielded four separate factors for consideration as the new 
impression management scale. Conceptual interpretation of these four factors suggested 
that they differed based on the target of the respondents? use of impression management, 
which included social conventions, reputation, responsibility, and emotions, respectively 
(see Table 3 for factors and items).  
To assess the construct validity of the four potential scales, a correlational 
analysis was performed to determine whether scores on these scales correlated 
significantly with scores on the BIDR (Table 4). Based on these results, it was concluded 
that the first impression management scale, which targeted social conventions, was the 
most similar to the BIDR?s impression management scale as well as most dissimilar to 
the BIDR?s self-deception enhancement scale. Therefore, the eight items from the social 
conventions scale were adopted for use as the Fitability 5a impression management scale.  
Discussion 
 Applicants and non-applicants scored differently on the Fitability 5a personality 
test, and one potential explanation for these score differences is that applicants are faking 
 
46 
 
good by inflating their responses on this test to appear more favorable to the hiring 
organization. The intentional use of false, but favorable response patterns is the defining 
characteristic of the construct known as impression management. Therefore, in Study 1, 
the principle goal was to develop a scale for detecting applicants? use of impression 
management on the Fitability 5a.  
 By following the scale development procedure offered by DeVellis (2003), an 
eight-item impression management scale was constructed for detecting applicant faking 
on the Fitability 5a. The items for this scale included: 
1. I never listen in on other people?s private conversations. 
2. I always tell the truth. 
3. I have lied to get myself out of trouble. (r) 
4. I rarely gossip. 
5. I sometimes talk bad about my friends behind their back. (r) 
6. I find it easy to resist temptations. 
7. I sometimes break the rules to get ahead. (r) 
8. I always know why I do the things I do.  
This scale is appropriate for use with the Fitability 5a because it employs the same 
response scale and sentence structure, thereby allowing its items to appear similar to the 
items on the Fitability 5a. Because it consists of only eight items, the new impression 
management scale fits with the Fitability 5a?s goal of offering a brief assessment of job 
candidates. In addition, because scores on this scale correlated significantly with scores 
from an existing impression management scale, there is preliminary evidence of construct 
validity. However, being that the new impression management scale was designed to 
 
47 
 
assess applicants? response patterns, it required application in a real world selection 
context to demonstrate whether it truly operates as intended. Therefore, Study 2 provided 
an applied context for administering the impression management scale to a sample of real 
world job applicants. According to theory, applicants who engage in faking should score 
differently on the Fitability 5a?s scales than applicants who do not fake. To assess the 
extent to which faking affected applicants? personality measurement, the IRT-based 
DFIT procedure was used to test the measurement equivalence of the Fitability 5a across 
high and low impression management groups.  
  
 
 
 
 
 
48 
 
 
 
 
Chapter 4 
Study 2: Is Faking a Problem for the Fitability 5a? 
 Organizational researchers have long been concerned with the issue of applicant 
personality test faking. In their investigation of the faking problem, researchers have 
employed a variety of methodologies, including direction manipulations, comparisons of 
applicants to non-applicants, and incorporating impression management scales into the 
selection battery. Much of this research relied on traditional analytic techniques based on 
classical test theory, such as making mean score comparisons between groups of fakers 
and non-fakers. Methodologies based on IRT, however, offer a more sophisticated and 
considerably superior technique for comparing different groups of respondents. Some 
organizational researchers have already incorporated IRT-based methods, such as the 
DFIT framework (Raju et al., 1995), into the investigation of the faking problem. These 
studies produced mixed results, with some studies indicating faking is a problem and 
others not. Study 2 attempted to build on this research by examining the affects of 
applicant faking on the Fitability 5a using the DFIT procedure, a large applied sample of 
job applicants, and an impression management scale designed exclusively for selection 
contexts.  
 Study 1 entailed the development of the Fitability 5a impression management 
scale and provided preliminary evidence of construct validity by demonstrating that 
students? scores on this scale correlated significantly with their scores on the impression 
 
49 
 
management scale of the BIDR. To determine if applicant faking is a problem for the 
Fitability 5a, however, actual job applicants that score highly on the impression 
management scale would need to respond differently to the Fitability 5a compared to 
applicants that do not score highly on the impression management scale. Raju et al.?s 
(1995) DFIT procedure provides an appropriate test for differential functioning across 
groups of fakers and non-fakers. Therefore, the purpose of Study 2 was to test the 
Fitability 5a for differential item functioning (DIF) and differential test functioning 
(DTF) across groups of high and low scorers on the Fitability 5a impression management 
scale using a sample of actual job applicants. If members of each group respond 
differently to the items or scales of the Fitability 5a, there would be reason to believe that 
faking disrupts the construct validity of this Big Five personality measure. Such an 
outcome would render the use of the Fitability 5a for personnel selection questionable, as 
users would not be able to compare scores on this test between those who fake and those 
that do not fake.     
Method 
Participants. Participants were 21,017 applicants to a large automotive parts and 
service company. All participants applied to work in locations within the United States 
between March and May 2009. Participants completed the measures through an online 
applicant screening program hosted by Fitability Systems. As part of the application 
process, participants voluntarily provided demographic information. Of the 20,910 
participants who reported their gender, 18,222 (87.1%) were men and 2,688 (12.9%) 
were women. Of the 16,630 participants who reported their race, 10,761 (64.7%) were 
Caucasian (non-Hispanic), 3,099 (18.6%) were African American, and 1,819 (10.9%) 
 
50 
 
were Hispanic. The positions applied for included technician/specialist (8,353; 39.7%), 
management (7,051; 33.5%), and customer service (5,613; 26.7%).  
Measures. Participants completed the Fitability 5a personality test and the 
Fitability 5a impression management scale (described in Study 1) as part of a larger 
selection battery containing additional selection tests not considered for the current study. 
The organization that provided access to the applicant participants based their selection 
decisions in part on applicants? scores on the Fitability 5a. The organization did not use 
applicants? scores on the impression management scale for making employment 
decisions; these data were collected for research purposes only.   
Procedure and analyses. Descriptive statistics and correlations among the 
variables for the total sample and the high and low impression management subgroups 
were obtained to provide a general overview of the data prior to conducting analyses for 
measurement equivalence.   
Because the DFIT methodology is based on IRT, the first step in testing for 
measurement equivalence was to ensure the data meet the assumptions of IRT. IRT 
assumes that measurement scales are unidimensional (Hambleton & Swaminathan, 
1985). Therefore, a principal-axis factor analysis was conducted on each of the Fitability 
5a scales as well as the impression management scale in order to test for the IRT 
assumption of unidimensionality of scales. To satisfy this assumption, the first factor of 
each scale would need to account for at least 20% of the variance (Reckase, 1979).     
 Next, in order examine measurement equivalence on the personality measure 
between fakers and non-fakers, the total sample of respondents was divided into two 
subgroups based on their impression management scores. Previous researchers have used 
 
51 
 
a variety of cutoff points for identifying high and low impression management groups. 
High score cutoffs have included the median (Stark et al., 2001), top quartile (Henry & 
Raju, 2006), and the top 15th percentile of scores (Flanagan & Raju, 1997). Low score 
cutoffs have included score ranges (e.g., scores between the 15th and 85th percentile; 
Flanagan & Raju) as well as scores below the 50th percentile (Henry & Raju). For the 
current study, the frequency distribution of impression management scores was examined 
to determine the most representative high and low scoring groups. Each subgroup of 
respondents consisted of 5,000 participants, because this value was the maximum sample 
size permitted by the statistical software used to analyze the data. The high scoring group 
(i.e., faking group; focal group) had an average impression management score of 4.91 
(SD = 0.09) and the low scoring group (i.e., non-faking group; reference group) had an 
average impression management score of 3.58 (SD = 0.24). Demographic characteristics 
for these groups were consistent with the total sample of participants. 
 As the Fitability 5a uses a polytomous response scale with five ordinal response 
categories, Samejima?s (1969) graded response model was used for performing the DFIT 
analyses. The computer program Multilog 7.03 (Thissen, Chen, & Bock, 2003) was used 
to estimate the parameters for each response option for each item and estimated the latent 
trait scores (?) for each respondent on each item. The Equate 2.1 computer program 
(Baker, 1997) was used to transform the parameter estimates for the faking and non-
faking groups as needed to place them onto a common metric.  
 Following this recalibration of the item parameters, DFIT analyses were 
conducted using the DFITps6 computer program (Raju, 2000). This program provides 
indices of compensatory and non-compensatory differential item functioning (CDIF and 
 
52 
 
NCDIF, respectively), as well as an estimate of differential test functioning (DTF). The 
NCDIF values were examined for each personality test item separately to ascertain the 
presence of DIF using a critical value of .096 for determining statistical significance at 
the .01 alpha level (as recommended by Raju et al., 1995). The critical values for 
determining statistically significant DTF differed by the number of items in each scale, 
and were calculated by multiplying the number of items in each scale by .096. For the 10-
item openness scale, the critical value for statistically significant DTF was set to .960. 
For the 11-item agreeableness, conscientiousness, and neuroticism scales, the critical 
value for statistically significant DTF was set to 1.056. For the 12-item openness scale, 
the critical value for statistically significant DFT was set to 1.152. In the presence of 
DTF, the DFITps6 program removed items with statistically significant CDIF one at a 
time until there was no longer DTF on the scale.     
 After testing for measurement equivalence on the Fitability 5a?s scales across 
faking and non-faking groups, the same set of procedures were completed using the total 
sample of respondents to assess for measurement equivalence on the impression 
management scale based on participants? gender and race. These analyses were 
performed to determine if the items on this scale functioned differently for male versus 
female applicants as well as for Caucasian applicants versus applicants that identified 
themselves as African American or Hispanic. Because there were fewer than 5,000 
women participants in the total sample, all female participants were considered in the 
gender analyses. However, because over 18,000 participants identified themselves as men 
(and due to the sample size constraints of the DFITps6 program), two random samples of 
5,000 male respondents were used when conducting the gender-based analyses to ensure 
 
53 
 
the estimates were stable and not subsample dependent. There were also fewer than 5,000 
African American and Hispanic participants in the total sample; therefore, all participants 
from these groups were considered. Because over 10,000 participants identified 
themselves as Caucasian (and due to the sample size constraints of the DFITpsa6 
program), two random samples of 5,000 Caucasian respondents were used when 
conducting the race-based analyses to establish the estimates were stable and not 
subsample dependent.            
Results 
Impression management item and scale means and standard deviations for the 
total sample and subsamples groups by gender and race appear in Table 5 for norming 
purposes. Scale means and standard deviations for the five Fitability 5a scales and the 
impression management scale using the total sample as well as the high and low 
impression management subsamples appear in Table 6. For each scale, applicants from 
the high impression management subgroup scored higher, on average, than applicants 
from the low impression management subgroup, with the exception of the neuroticism 
scale, which demonstrated the opposite trend. Estimates of the effect sizes (Cohen?s d) of 
these score differences indicated that these score differences were practically meaningful 
(i.e., exhibited large effect sizes greater than .80) with the exception of the extraversion 
scale, which exhibited a small effect (d = .19). 
Correlations among the five Fitability 5a scales and the impression management 
scale appear in Tables 7, 9, and 9 for the total sample of applicants and the high and low 
impression management subgroups, respectively. In all cases, applicant?s impression 
management scores were positively and significantly correlated with their scores on the 
 
54 
 
agreeableness, conscientiousness, extraversion, and openness scales, and negatively and 
significantly correlated with their scores on the neuroticism scale. However, given the 
large sample sizes used in the current analyses, even small, non-meaningful relationships 
were likely to be found statistically significant. The correlations between applicants? 
impression management scores and their extraversion scores, for instance, were .09, .05, 
and .04 for the total sample, high impression management group, and low impression 
management group, respectively. Though considerably weak, each of these values were 
statistically significant at (at least) the .05 level.      
To examine the Fitability 5a for measurement equivalence using the DFIT 
methodology, tests were performed to satisfy the IRT assumption of unidimensionality 
scales. Results of principle-axis factor analysis indicated that the first component of each 
of the five personality scales and the impression management scale exceeded Reckase?s 
(1979) cutoff of at least 20% of the variance, which satisfied the unidimensionality 
assumption and thereby supported the use of the remaining IRT-based analyses.   
 Results of the DFIT analyses for assessing measurement equivalence between 
fakers and non-fakers on the Fitability 5a personality test appear in the following sections 
alphabetically by. Next, results of the analyses for assessing measurement equivalence on 
the Fitability 5 impression management scale appear in the following order: Males versus 
Females, Caucasians versus African Americans, and then Caucasians versus Hispanics. 
DFIT on the Fitability 5a. The following sets of results are based on analyses that 
tested for measurement equivalence on each of the Fitability 5a?s personality scales when 
comparing subgroups of respondents (n = 5,000 each) identified as fakers (high 
impression management; focal group) and non-fakers (low impression management; 
 
55 
 
reference group). Items were equated by putting them on the scale of the focal group, 
which in this case was the faking group. Items were considered to have statistically 
significant DIF at an alpha level of .01 if the NCDIF values exceeded the .096 cutoff 
score recommended by Raju et al. (1995) for items with five response options. The cutoff 
scores for determining DTF are provided separately for each scale analysis below, as 
these cutoff values vary by the number of items in the scale. For scales with significant 
DTF, items with the most significant CDIF were removed one at a time until there was no 
longer DTF for the scale. (Note: One can remove items based on the NCDIF or CDIF 
values, as both produce a final scale free of DTF. However, there latter method results in 
fewer items deleted and is therefore preferable if one wishes to retain as many items as 
possible on the final DTF-free scale). 
Agreeableness. Agreeableness item means by subgroup, as well as the NCDIF, 
CDIF, and DTF values appear in Table 10. Applicants identified as fakers scored higher 
on average on all items from this scale. When controlling for ability level across 
subgroups of fakers and non-fakers, 8 out of 11 agreeableness items showed statistically 
significant NCDIF. The agreeableness scale also exhibited significant DTF, as the 
uncorrected DTF index (19.51) exceeded the critical value for an 11-item scale (1.056). 
Items with statistically significant CDIF were deleted from this scale one at a time until 
the scale no longer exhibited DTF. The final DTF-free agreeableness scale consisted of 
three items (6, 10, and 22).  
Conscientiousness. Conscientiousness item means by subgroup, as well as the 
NCDIF, CDIF, and DTF values appear in Table 11. Applicants identified as fakers scored 
higher on average on all items from this scale. When controlling for ability level across 
 
56 
 
subgroups of fakers and non-fakers, 10 out of 11 conscientiousness items showed 
statistically significant NCDIF. The conscientiousness scale also exhibited significant 
DTF, as the uncorrected DTF index (34.17) exceeded the critical value for an 11-item 
scale (1.056). Items with statistically significant CDIF were deleted from this scale one at 
a time until the scale no longer exhibited DTF. The final DTF-free conscientiousness 
scale consisted of two items (4 and 50).  
Extraversion. Extraversion item means by subgroup and NCDIF values appear in 
Table 12. Applicants identified as fakers scored higher on average on all items from this 
scale. When controlling for ability level across subgroups of fakers and non-fakers, none 
of the extraversion items showed statistically significant NCDIF. There was no evidence 
of statistically significant DTF for the extraversion scale as the uncorrected DTF index 
(0.68) was well below the critical value for a 12-item scale (1.152). Therefore, no items 
were deleted from this scale, and no CDIF or DTF data were necessary for Table 12. 
Neuroticism. Neuroticism item means by subgroup, as well as the NCDIF, CDIF, 
and DTF values appear in Table 13. Applicants identified as fakers scored lower on 
average on all items from this scale. When controlling for ability level across subgroups 
of fakers and non-fakers, 8 out of 11 neuroticism items showed statistically significant 
NCDIF. The neuroticism scale also exhibited significant DTF, as the uncorrected DTF 
index (44.20) exceeded the critical value for an 11-item scale (1.056). Items with 
statistically significant CDIF were deleted from this scale one at a time until the scale no 
longer exhibited DTF. The final DTF-free neuroticism scale consisted of three items (3, 
15, and 48).  
 
57 
 
Openness. Openness item means by subgroup, as well as the NCDIF, CDIF, and 
DTF values appear in Table 14. Applicants identified as fakers scored higher on average 
on all items from this scale. When controlling for ability level across subgroups of fakers 
and non-fakers, 9 out of 10 openness items showed statistically significant NCDIF. The 
openness scale also exhibited significant DTF, as the uncorrected DTF index (20.20) 
exceeded the critical value for a 10-item scale (0.96). Items with statistically significant 
CDIF were deleted from this scale one at a time until the scale no longer exhibited DTF. 
The final DTF-free openness scale consisted of two items (20 and 44).  
DFIT on the impression management scale. The remaining results are based on 
the analyses that tested for measurement equivalence on the impression management 
scale based on gender (males versus females) and race (Caucasians versus African 
Americans and Caucasians versus Hispanics). For the gender-based analyses, two random 
subsamples of 5,000 male participants were tested, as over 18,000 applicants identified 
themselves as male and the computer programs used to analyze the data set sample size 
constraints to 5,000 subjects per comparison group. Similarly, for the race-based 
analyses, two random subsamples of 5,000 Caucasian participants were tested, as over 
10,000 applicants identified themselves as Caucasian. Analyses based on these 
subsamples were use to evaluate whether the estimates were stable and not subsample 
dependent. 
For the gender comparisons, items were equated by putting them on the scale of 
the male respondents. For the race comparisons, items were equated by putting them on 
the scale of the Caucasian respondents. Items were considered to have statistically 
significant DIF at an alpha level of .01 if the NCDIF values exceeded the .096 (Raju et 
 
58 
 
al., 1995). The cutoff score for determining statistically significant DTF was set to .768 
(based on .096 x 8 items).   
Impression management item means and standard deviations for each group of 
respondents (i.e., two male subsamples, all females, two Caucasian subsamples, all 
African Americans, and all Hispanics) appear in Table 15. Based on the effect size 
estimates (Cohen?s d), any differences in the means scores based on gender or race were 
negligible. Table 15 also contains the NCDIF values for the gender and race-based item-
level comparisons when controlling for ability level in impression management. All 
NCDIF values were below the critical value of .096 and therefore non-significant. Thus, 
there was no evidence of DIF on the impression management scale based on gender or 
race. There was also no evidence of DTF on this scale based on gender or race, as all 
DTF values were below the critical value of .768 (therefore, the CDIF and DTF values do 
not appear in Table 15). Based on these results, the Fitability 5a impression management 
total scale and individual items function equivalently for male and female applicants as 
well as for Caucasian, African American, and Hispanic applicants.    
Discussion 
 The DFIT procedure allows researchers to compare how individuals from 
different groups respond to items and scales. In the current study, the response functions 
on the Fitability 5a personality test were compared across subsamples of job applicants 
grouped according to their scores on the Fitability 5a impression management scale. This 
comparison was used to determine if fakers (i.e., high impression management scorers) 
responded differently to the Fitability 5a compared to non-fakers (i.e., low impression 
management scorers). Evidence of group differences would indicate that applicant faking 
 
59 
 
has the potential to disrupt the measurement equivalence of a Big Five personality test 
used for selection.  
 Study 2 revealed that applicants from the high and low impression management 
groups responded differently to the Fitability 5a. Of the 55 items on the Fitability 5a, 35 
demonstrated significant DIF. Only the extraversion scale did not contain items with 
significant DIF. In all cases, DIF uniformly favored the high IM group. Of the Fitability 
5?s five scales, four demonstrated significant DTF. Only the extraversion scale did not 
demonstrate significant DTF. To produce a measure free of DTF required the elimination 
of the majority of items and would have resulted in a 3-item agreeableness scale, a 2-item 
conscientiousness scale, a 3-item neuroticism scale, and a 2-item openness scale. The 
extraversion scale would retain all 12 of its items, as this scale was free of DIF and DTF. 
From an applied perspective, these results provide strong support that applicant faking is 
a problem for this particular Big Five selection test.   
 
 
60 
 
 
 
 
Chapter 5 
General Discussion 
 This dissertation makes two significant contributions to the investigation of 
applicant personality test faking. The first contribution met an applied need by 
developing an impression management scale for use with the Fitability 5a, a Big Five 
personality test used for employment selection. The second contribution provided 
empirical evidence to address the question: Does faking matter in the real world? The 
sections that follow provide a discussion of these contributions, followed by the 
limitations and implications of this research, including recommendations for future 
studies on applicant personality test faking.   
Contribution 1: The Fitability 5a Impression Management Scale 
 Organizational researchers have long been concerned with the issue of applicant 
faking on personality tests used for selection. In investigating the prevalence and severity 
of the faking problem, researchers have relied on a variety of methodologies, including 
the use of impression management scales for identifying applicants that engage in 
intentional response distortion. Although there are numerous impression management 
scales available for use in organizational research, many of these measures are too 
lengthy for applied purposes or intended for general populations as opposed to job 
applicants. In addition, off-the-shelf scales tend to use their own unique response formats 
that might affect applicants? response patterns. If an impression management scale?s 
 
61 
 
response format does not match the response format of the other tests in the assessment 
battery, it may cue respondents that the impression management items are assessing a 
different construct and may therefore evoke different response patterns. For these 
reasons, some personality tests contain custom impression management scales designed 
to detect applicant faking exclusively on their measure. 
 The first contribution of this dissertation met an applied need by developing a 
custom impression management scale for the Fitability 5a. The Fitability 5a is a Big Five 
personality test designed for screening job candidates quickly and accurately. The newly 
developed Fitability 5a impression management scale is custom in that its items adhere to 
the format of the Fitability 5a items in terms of reading level and response scale (i.e., 1 = 
Strongly disagree to 5 = Strongly agree). In addition, the impression management scale 
contains only eight items, so it is consistent with the Fitability 5a?s goal of providing a 
timely assessment.         
 In developing the Fitability 5a impression management scale, items were 
constructed to resemble items on other impression management scales, such as Paulhus? 
(1998) BIDR, which ensured that the initial item pool addressed intentional response 
distortion rather than self-deception enhancement. To maximize the likelihood of 
detecting applicant faking, the item pool was administered to student participants using 
respond honest and fake good instructions. This direction manipulation identified the 
items that demonstrated the largest effect sizes attributable to faking. After eliminating 
items that were less likely to detect faking, a factor analysis was performed using data 
from a normal instructions sample to arrive at the final impression management scale. 
Students? scores on this scale correlated positively and significantly with Paulhus? 
 
62 
 
impression management scale, which suggested that the new measure assesses the 
construct of impression management. 
 The development of the Fitability 5a impression management scale satisfied the 
applied needs of Fitability Systems by providing a short and reasonably construct valid 
measure for assessing applicant faking. A key assumption behind impression 
management measurement is that examinees are unaware that they are responding to 
impression management items: Otherwise, test administrators and researchers could not 
use scores on these scales for identifying faking-related response patterns. Because the 
items on new impression management scale resemble the Fitability 5a? s items in terms of 
length, readability, and response format, it is believed that they could be added 
seamlessly to the personality measure without cueing examinees that the impression 
management items are assessing a sixth construct intended for the detection of faking.  
Measurement equivalence was examined on the impression management scale 
across gender and race-based groups. These analyses indicated that males and females as 
well as Caucasians, African Americans, and Hispanics tended to respond similarly to the 
items on this scale. This finding provided further evidence of construct validity for the 
scale, as members of different groups should interpret a construct valid measure in the 
same way. Findings of measurement equivalence across gender and race also supports the 
use of Fitability 5a impression management scale with applicant populations, as items on 
this scale do not appear to favor members of groups protected by equal employment laws 
(e.g., women, racial minorities).    
 Organizational researchers disagree on the extent to which applicant faking 
should be considered a real world threat. The newly developed Fitability 5a impression 
 
63 
 
management scale has the potential to add to the investigation of applicant personality 
test faking by provided a measure designed exclusively for use with job applicants as 
opposed to general populations. Although there are a variety of off-the-shelf impression 
management scales for detecting intentional response distortion, most of these measures 
are designed for general purposes, contain a large set of items, utilize unique response 
scales, or otherwise do not incorporate well into personnel selection test batteries. 
Applied research investigating the faking problem using off-the-shelf impression 
management scales may be limited in that applicants may respond to these scales 
differently compared to scales designed exclusively for organizational contexts. As the 
Fitability 5a impression management scale was essentially designed as the sixth scale of 
this Big Five personality measure, it has the potential to evoke applicant response 
patterns that inform the faking debate in ways that off-the-shelf impression management 
scale do not.   
Contribution 2: Faking Matters in the Real World 
The second major contribution of this dissertation sought to answer the question: 
Does faking matter in the real world? To this end, the new Fitability 5a impression 
management scale was implemented in a real world selection setting consisting of actual 
job applicants. This applied investigation served to validate the new scale further by 
determining (a) whether high and low scorers on the impression management scale 
scored differently on the Fitability 5a?s five scales and (b) whether faking affected the 
measurement equivalence of the personality measure.  
 A comparison of the mean score differences between high and low impression 
management applicants on the Fitability 5a?s five scales revealed considerable score 
 
64 
 
differences on all but the extraversion scale in terms of effect size. Those who scored 
higher on the impression management scale scored higher on the agreeableness, 
conscientiousness, and openness scales and lower on the neuroticism scale. Based on 
these score comparisons, one may conclude that the applicants who engaged in 
impression management scored in the more desirable direction compared to applicants 
who did not engage in impression management. These results served to validate the new 
scale, as producing favorable responses is a hallmark characteristic of impression 
management behavior.  
The finding that some scales exhibited mean score differences and others did not 
is not unusual in the personality test faking literature. Although some research suggests 
that all Big Five scales are equally susceptible to applicant faking (e.g., Viswesvaran & 
Ones, 1999), other studies suggest that this outcome is not the case (e.g., Birkeland et al. 
2006). The results of the current research support the latter findings in that each of the 
Fitability 5a scales were differentially susceptible to applicant faking. 
 In their meta-analytic study, Viswesvaran and Ones (1999) suggested that, of all 
of the Big Five factors, agreeableness may be the least susceptible to faking. However, in 
the current study, the only scale that did not exhibit a meaningful effect for faking in was 
the extraversion scale. One potential explanation for why the high and low impression 
management applicants did not score differently on the extraversion scale is that the most 
favorable response to the items on this scale did not always fall in the same direction (i.e., 
toward extraversion or introversion). On the other four personality scales, members of the 
high impression management group consistently scored higher (or lower, for neuroticism) 
on all items compared to the low impression management group. This was not the case 
 
65 
 
for extraversion. An examination of the item-level mean scores for this scale (Table 12) 
reveals that the high impression management group produced higher scores on seven 
extraversion items and lower scores on five of these items. Essentially, applicants who 
engaged in impression management interpreted approximately half the items as more 
favorable when responding toward the high end of the response scale (i.e., toward 
extraversion) and approximately half the items as more favorable when responding 
toward the low end of the response scale (i.e., toward introversion). Thus, at the item 
level, applicants that faked tended to score differently on extraversion compared to 
applicants who did not fake. At the scale level, however, the net effect of these 
differences is that the item-level scores cancelled each other out, resulting in near 
equivalent scale-level scores for the high and low impression management groups.  
The finding that fakers and non-fakers score approximately the same at the scale 
level, but differently at the item level on the extraversion scale supports the use of item-
level analyses in personality test faking research. Organizations tend to base selection 
decisions on scale-level scores, which may explain why the majority of personality test 
faking research is conducted using scale-level scores. However, this practice has the 
potential to mask true differences among applicants that occur at the item level. Each 
personality test item is, in and of itself, a test of a latent personality construct. Therefore, 
item-level analyses are important to the investigation of the faking problem. The results 
of the current study support the need for additional research examining item-level score 
differences between fakers and non-fakers. Of course, item-level analyses require large 
samples sizes that are often difficult for organizational researchers to achieve, particularly 
large applied samples. As the current research contained data from over 20,000 actual job 
 
66 
 
applicants, it offers a relatively rare, but informative perspective on how applicant faking 
manifests in applied contexts.    
 Beyond mean score comparisons, the present research also provided more 
sophisticated item-level and scale-level analyses utilizing the DFIT methodology. At the 
item-level, significant differential item functioning (DIF) occurred on the majority of 
items for the agreeableness, conscientiousness, neuroticism, and openness scales. Only 
the extraversion scale exhibited measurement equivalence between the high and low 
impression management groups. This particular finding supports the results of Flanagan 
and Raju?s (1997) study in which the extraversion scale of the 16-PF exhibited 
measurement equivalence between fakers and non-fakers. Flanagan and Raju?s study was 
limited, however, in that it only examined this one scale of the 16-PF. In a follow-up 
study, Stark et al. (2001) tested for measurement equivalence on all 16 scales of the 16-
PF. Unlike the current research, Stark et al. found little evidence of DIF on these scales 
when comparing applicants grouped by impression management scores.  
One potential reason for why DIF was found in the current study but not in Stark 
et al.?s (2001) study is the choice of personality measurement. The Fitability 5a is a Big 
Five measure of personality that uses a five-category response scale. The 16-PF measures 
16 different personality traits, only some of which map onto the five-factor model. In 
addition, the 16-PF uses a three-category response scale, which, as explained previously, 
may limit the degree to which applicants can fake good. That DIF occurred in the current 
Big Five study and not in previous research using the 16-PF emphasizes the need to 
evaluate applicant response patterns for all personality measures. The results of faking 
research investigating differential functioning by impression management groups do not 
 
67 
 
generalize across different personality measures. Because each personality measure 
contains different items, it is likely that different measures will be differentially 
susceptible to DIF.  
 At the scale level, results were similar to the mean score analyses in that 
differential test functioning (DTF) occurred for all of the Fitability 5a?s scales with the 
exception of the extraversion scale. This finding suggests that some personality 
constructs may be more susceptible to DTF caused by impression management than other 
constructs, which is consistent with previous research investigating the measurement 
equivalence of personality scales across fakers and non-fakers. Zickar and Robie (1999), 
for instance, found significant DTF on two of three scales on the military?s ABLE 
personality test. One explanation for these findings is that applicants may perceive certain 
scales to be more job-related, which may prime applicants likely to engage in impression 
management (Henry & Raju, 2006). Additional research may seek to investigate DIF and 
DTF by comparing individuals grouped according to whether they perceive each item or 
scale as being related to the position in question. There are also numerous individual 
difference variables that may explain further how and why different applicants, including 
fakers and non-fakers, respond differently to personality test items and scales. Teague 
and Thomas (2008), for instance, recently found that intelligence and mood state affect 
faking. Although the addition of moderating variables may complicate the examination of 
differential test functioning between fakers and non-fakers, it will likely produce more 
informative results than could be achieved in the current study.      
 In the presence of DTF, test administrators can eliminate items that exhibit 
significant DIF in order to achieve DTF-free scales. In the current study, the four scales 
 
68 
 
that exhibited DTF contained DIF items that uniformly favored the high impression 
management group. This finding limited the utility of the CDIF indices. Using the CDIF 
item deletion procedure to produce DTF-free scales resulted in a 3-item agreeableness 
scale, a 2-item conscientiousness scale, a 3-item neuroticism scale, and a 2-item openness 
scale. Thus, to achieve measurement equivalence on the Fitability 5a?s scales between 
fakers and non-fakers, the vast majority of items on this test would need to be removed. 
As the Fitability 5a is already a brief personality assessment tool, the removal of multiple 
items would considerably reduce the validity of this test for employment contexts.  
Overall, at the item level and the scale level, impression management severely 
impacted the measurement equivalence of the Fitability 5a, as the majority of test items 
exhibited DIF and the majority of scales exhibited DTF well beyond repair. These results 
contribute to the personality test faking literature by providing a relatively rare 
examination of DTIF on a Big Five personality test using high and low impression 
management groups from a large applied sample of job applicants. Although some 
researchers have concluded that faking is not a problem in real world settings (e.g., 
Hogan et al., 2003), the current study provides strong evidence to the contrary. Similar to 
Schmit and Ryan?s (1993) factor analysis research, this dissertation suggests that 
impression management adversely affects the construct validity of personality measures 
to a severe degree. 
Limitations and Future Directions  
As with any study, there are limitations to the current research that deserve 
consideration. These limitations, in turn, give rise to directions for future research.  
 
69 
 
The first limitation concerns the sample used for developing the Fitability 5a 
impression management scale. The scale development sample consisted entirely of 
undergraduate student participants. It is possible that students respond differently than 
actual job applicants, even when instructed to respond as if they were in an employment 
context. Any differences between the developmental sample and the applied population 
for which the measure is intended have the potential to influence how respondents 
interact with the measure, and may therefore limit the degree to which the scale actually 
detects applicant faking. A more appropriate developmental sample would consist of 
actual job applicants as opposed to undergraduate student participants.   
An additional limitation concerns the direction manipulation instructions provided 
to students. Specifically, students were instructed to respond in a manner that would 
present the most favorable impression as a salesman/saleswoman. Thus, the items that 
appear on the final impression management scale may be biased toward detecting faking 
for sales positions rather than other positions. Also, as noted by McFarland and Ryan 
(2006), not every student may want a sales job, which may affect their motivation to 
respond to items as if they were sales applicants. As an alternative, the instructions could 
have requested that students respond in a manner that would maximize their chances for 
obtaining their dream job (e.g., Mueller-Hanson, Heggestad, & Thorton, 2006). However, 
different students could interpret these instructions differently, which could further affect 
the validity of the measure when placed in an applied context. 
 The scale development phase of this research was also limited in that the factor 
analysis performed on the impression management items produced a four-factor solution, 
even though the impression management construct is theoretically unidimensional. 
 
70 
 
Researchers have yet to investigate if there are global and facet-level conceptualizations 
of impression management. However, research in the area of job satisfaction suggests 
that some psychological constructs lend themselves to overall and target-specific forms 
(e.g., Highhouse & Becker, 1993; Scarpello & Campbell, 1983). Schmit and Ryan?s 
(1993) study suggested that faking-related scales may consist of items that assess a 
variety of constructs, which lends support to a possible multi-dimensional 
conceptualization of impression management. Future researchers may seek to explore this 
possibility. In the current research, the four potential impression management scales 
addressed four different target areas: social conventions, reputation, responsibility, and 
emotions. However, the impression management toward social conventions scale 
produced scores most similar to the BIDR?s impression management scale and most 
dissimilar to the BIDR?s self-deception enhancement scale. Therefore, this scale was 
adopted over the other potential choices as the final Fitability 5a impression management. 
One could argue that the Fitability 5a impression management scale has not 
undergone sufficient validation tests to warrant its use in applied research. In part, fewer 
tests were performed on this scale than in common practice because the agreement 
established with Fitability Systems to develop the scale required immediate action. 
Although adequate tests for validation were performed on the impression management 
scale in the current research, further tests are recommended before this scale undergoes 
widespread use.  
A lack of control over the applicant sample provided additional limitations for 
discussion. Although the applicants participating in this study all applied to the same 
organization, they did not all apply for the same position. Applicant participants applied 
 
71 
 
for work in three major job categories: technician/specialist, management, and customer 
service. However, the DFIT analyses were performed across all job types to maintain 
sufficient and equivalent sample sizes in each impression management group. It is 
possible that applicants engaging in impression management did so differently based on 
the job to which they were applying. An appropriate follow-up study would investigate 
this possibility by comparing impression management within job categories rather than 
across job categories.  
 Applicant participants also applied exclusively to locations within the U.S. 
Organizations, including the organization investigated in the current research, are 
becoming more global. Culture is a critical variable in organizational research (Rousseau 
& Fried, 2001) and has the potential to influence how individuals interpret and respond to 
test items (Mitchelson et al., 2009). Members of different cultures may be more or less 
inclined to engage in impression management, for instance, based on whether their 
cultures are individualistic versus collective or depending on their religious ideology. The 
closest the current study came to considering culture was the analyses testing for 
measurement equivalence on the impression management scale based on racial 
affiliation. Although there were no racial differences in response functioning on the 
impression management scale, it is possible that members of these groups interpreted the 
Fitability 5a items differently, which may have influenced the results. More advanced 
methodologies incorporating nested designs may offer an opportunity for future 
researchers to take a variety of demographic or group membership variables into account 
when assessing measurement equivalence for individuals in groups within groups.  
 
72 
 
 In part, the limitations of previous DFIT research provided the rationale for the 
current study. For instance, no previous research had yet examined measurement 
equivalence on all five scales of a true Big Five measure of personality between high and 
low impression management groups. Other studies were limited in their use of two or 
three option response scales, which have the potential to restrict the degree to which 
respondents can fake good. In comparison to these earlier studies, most of which found 
little or no evidence of DIF or DTF, the current study investigated a Big Five measure 
that uses a five-point response scale and uncovered considerable evidence of DIF and 
DTF. Therefore, future research in this area may wish to replicate the current study using 
additional Big Five measures to determine whether the present results are more 
attributable to the five-factor model of personality or to the Fitability 5a test itself.  
In addition, it may be of value to investigate whether the smaller response scales 
used by tests such as the CPI and 16-PF are responsible for measurement equivalence by 
testing the items on these measures with their traditional response format as well as with 
an expanded response scale consisting of additional options. An assumption made in the 
current study is that smaller response scales were partially responsible for findings of 
measurement equivalence between fakers and non-fakers in previous DFIT research. 
Thus, an appropriate test of this assumption might involve introducing a polytomous 
response scale to a traditionally dichotomous test to determine whether measurement 
equivalence is more a property of the test?s items or the test?s response scale.  
  One final limitation that permeates all phases of this research concerns the 
conceptualization of the faking variable. Researchers have conceptualized applicant 
faking as a trait variable, a situational variable, or both (e.g., Stark et al., 2001). Trait-
 
73 
 
based faking studies view faking as an individual difference variable and tend to assess 
faking using self-report measures, such as impression management scales. The current 
study adopted this approach by defining and assessing faking as the tendency to present 
oneself favorably, but falsely. Situational studies, on the other hand, view faking as a 
behavior or response strategy that ?may manifest itself differently in different situations? 
(Mueller-Hanson et al., 2006, p. 309). These studies tend to assess faking by comparing 
scores obtained under different response conditions, such as by comparing individuals? 
scores from applicant and non-applicant conditions. 
Comparisons of results produced from these two conceptualizations of faking 
indicate that they produce similar, but not identical results (e.g., McFarland & Ryan, 
2006; Stark et al., 2001), which has been noted as one reason the faking debate remains 
unresolved (Ones et al., 2007). The current and leading theoretical models of applicant 
faking tend to favor the faking-as-behavior approach (e.g., McFarland & Ryan; Mueller-
Hanson et al., 2006). However, tests of situational models of faking do not lend 
themselves easily to true applicant samples, as they require data collection over multiple 
periods. Organizations may be reluctant to provide access to their current or potential 
employees for repeated testing, thereby presenting a considerable hurdle for researchers 
wishing to extend lab-based tests of situational models to applied settings. As trait-based 
studies of faking require only single administration of a faking-based measure to 
investigate this phenomenon, they are more practical for applied research. The current 
research developed out of an applied need to determine whether faking was a problem for 
the Fitability 5a, therefore the trait-based strategy was deemed most appropriate. Future 
researchers should seek to incorporate both trait-based and situational-based models of 
 
74 
 
faking into their investigations of the faking problem, or at least recognize the limitations 
and benefits of each approach.  
Implications.  
There are several implications of the current research. Presently, organizational 
researchers have mixed opinions as to whether applicant faking is a problem for applicant 
personality testing. Much of the research in this area has examined the effects of faking 
on the criterion-related validity of personality tests, such as whether applicant faking 
results in different rank ordering of applicants. An alternative consequence of applicant 
faking entails the degree to which the psychological meaning of personality test scores 
changes due to faking. For organizations to use personality tests for making employment 
decisions, their tests must demonstrate measurement equivalence. In the absence of 
measurement equivalence, test scores become impossible to interpret, as they do not carry 
the same psychological meaning for members of different groups. The finding of DIF and 
DTF on the Fitability 5a between high and low impression management groups, then, is 
an issue of considerable practical importance.  
 The Fitability 5a?s items and scales for the latent constructs of agreeableness, 
conscientiousness, neuroticism, and openness functioned differently for applicants that 
engaged in impression management versus those that did not. Mean item and scale-level 
scores were not only more desirable for the high impression management group, but 
results of the DFIT analyses suggest that the items and scales measured different latent 
constructs for fakers versus non-fakers. In this sense, applicant faking destroyed the 
construct validity of the Fitability 5a, thereby rendering the use of this measure for 
making selection decisions impossible.  
 
75 
 
 Questions abound as to the proper method for investigating the faking problem. 
Thus far, the different methodologies employed by faking researchers have produced 
mixed results, leading to increased confusion and debate. The current research 
investigated the faking problem using the IRT-based DFIT methodology and determined 
that applicants? engagement in impression management produced differential functioning 
on the items and scales of a Big Five selection test. These findings have considerable 
implications for the ongoing debate surrounding the faking problem and suggest that 
future research continue to utilize more advanced approaches to item and scale analyses, 
such as those offered by IRT, as well as applied samples of real world job applicants.  
 The degree to which the current findings generalize to other personality tests or 
applicant populations is unknown. The Fitability 5a assesses the most readily accepted 
model of personality, the five-factor model. However, the items on the Fitability 5a are 
sure to differ from the items of other Big Five measures; therefore, it is entirely possible 
that applicant faking is not a problem for other Big Five tests. Nevertheless, because the 
majority of the Fitability 5a?s items and scales exhibited differential functioning between 
high and low impression management groups, it seems possible if not probable that at 
least some items and scales of other self-report Big Five personality tests with 
polytomous response formats will exhibit similar trends. Additional research aimed 
toward replicating the current study with other Big Five measures is necessary. 
Converging evidence of differential functioning between fakers and non-fakers on Big 
Five measures would call into question the use of these tests for employment decision-
making. In light of the current findings, it may be time for organizational researchers and 
 
76 
 
practitioners to begin looking toward alternative methods of personality assessment that 
are less susceptible to applicant faking. 
 
77 
 
 
 
 
References 
Ackerman, P. L., & Heggested, E. D. (1997). Intelligence, personality, and interests: 
Evidence for overlapping traits. Psychological Bulletin, 121, 219-245. 
Allport, G. W., & Odbert, H. S. (1936). Trait-names: A psycho-lexical study. 
Psychological Monographs, 47, 211. 
Baker, F. B. (1997). Equate 2.1: Computer program for equating two metrics in time 
response theory [Computer software]. Madison: University of Wisconsin, 
Laboratory of Experimental Design.  
Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job 
performance: A meta-analysis. Personnel Psychology, 44, 1-26. 
Barrick, M. R., & Mount, M. K. (1994). Personal Characteristics Inventory technical 
manual. Iowa City: University of Iowa, Department of Management and 
Organizations.  
Barrick, M. R., & Mount, M. K. (1996). Effects of impression management and self-
deception on the predictive validity of personality constructs. Journal of Applied 
Psychology, 81, 261-272.  
Barrick, M. R., & Mount, M. K. (2005). Yes, personality matters: Moving on to more 
important matters. Human Performance, 18, 359-372. 
 
78 
 
Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the 
beginning of the new millennium: What do we know and where do we go next? 
Personality and Performance, 9, 9-29. 
Birkeland, S. A., Manson, T. M., Kisamore, J. L., Brannick, M. T., & Smith, M. A. 
(2006). A meta-analytic investigation of job applicant faking on personality 
measures. International Journal of Selection and Assessment, 14, 317-335. 
Borgatta, E. F. (1964). The structure of personality characteristics. Behavioral Science, 
12, 8-17. 
Cattell, R. B. (1957). Personality and motivation: Structure and measurement. Journal of 
Personal Disorders, 19, 53-57. 
Cattell, R. B., Eber, H. W., & Tatsuoka, M. M. (1970). Handbook for the Sixteen 
Personality Factor Questionnaire (16PF). Champaign, IL: Institute for 
Personality and Ability Testing. 
Cervone, D., & Pervin, D. C. (2007). Personality: Theory and research (10th ed.). New 
York: John Wiley & Sons. 
Christiansen, N. D., Goffin, R. D., Johnston, N. G., & Rothstein, M. G. (1994). 
Correcting the 16PF for faking: Effects on criterion-related validity and individual 
hiring decisions. Personnel Psychology, 47, 847-860. 
Costa, P. T., Jr., & McCrae, R. R. (1989). The NEO personality inventory manual. 
Odessa, FL: Psychological Assessment Resources. 
Costa, P. T., Jr., & McCrae, R.R. (1992). Revised NEO personality inventory (NEO-PI-R) 
and NEO five-factor (NEO-FFI) inventory professional manual. Odessa, FL: 
PAR. 
 
79 
 
Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of 
psychopathology. Journal of Consulting Psychology, 24, 349-354. 
Davis, B. L., Skube, C. J., Hellervik, L. W., Gebelein, S. H., & Sheard, J. L. (1996). 
Successful manager?s handbook. Minneapolis, MN: Personnel Decisions 
International.  
DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousand 
Oaks, CA: Sage.  
Digman, J. M. (1990). Personality structure: Emergence of the five-factor model. Annual 
Review of Psychology, 41, 417-440.  
Donovan, J. J., Dwight, S. D., & Hurtz, G. M.  (2003). An assessment of the prevalence, 
severity, and verifiability of entry-level applicant faking using the randomized 
response technique.  Human Performance, 16, 81-106. 
Douglas, E. F., McDaniel, M. A., & Snell, A. F. (1996). The validity of non-cognitive 
measures decays when applicants fake. Academy of Management Proceedings, 
127-131. 
Drasgow, F. (1989). An evaluation of marginal maximum likelihood estimation for the 
two-parameter logistical model. Applied Psychological Measurement, 13, 77-90. 
Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in 
heterogeneous populations. Journal of Applied Psychology, 70, 662-680. 
Dunnette, M. D., McCartney, J., Carlson, H. C., & Kirchner, W. K. (1962). A study of 
faking behavior on a forced-choice self-description checklist. Personnel 
Psychology, 15, 13-24. 
 
80 
 
Ellingson, J. E., Sackett, P. R., & Connelly, B. S. (2007). Personality assessment across 
selection and development contexts: Insights into response distortion. Journal of 
Applied Psychology, 92, 386-395. 
Ellingson, J. E., Sackett, P. R., & Hough, L. M. (1999). Social desirability corrections in 
personality measurement: Issues of applicant comparison and construct validity. 
Journal of Applied Psychology, 84, 155-166. 
Ellis, B. B., & Mead, A. D. (2002). Item analysis: Theory and practice using classical and 
modern test theory. In S. G. Rogelberg (Ed.), Handbook of research methods in 
industrial and organizational psychology (pp. 324-343). Malden, MA: Blackwell.  
Fiske, D. W. (1949). Consistency of the factorial structures of personality ratings from 
different sources. Journal of Abnormal and Social Psychology, 44, 329-344. 
Flanagan, W.J., & Raju, N. S. (1997, April). Measurement equivalence between high and 
average impression management groups: An IRT analysis of personality 
dimensions. Paper presented at the annual meeting of the Society for Industrial 
and Organizational Psychology, St. Louis. 
Foldes, H. J., Duehr, E. E., & Ones, D. S. (2008). Group differences in personality: Meta-
analyses comparing five U.S. racial groups. Personnel Psychology, 61, 579-616. 
Galton, F. (1884). Measurement of character. Fortnightly Review, 36, 179-185. 
Gatewood, R. D., & Feild, H. S. (2001). Human resource selection (5th ed.). Mason, 
Ohio: South-Western. 
Goldberg, L. R. (1981). Language and individual differences: The search for universals in 
personality lexicons. In Wheeler (Ed.), Review of personality and social 
psychology, 1 (pp. 141-165). Beverly Hills, CA: Sage.  
 
81 
 
Gough, H. G., & Bradley, P. (1996). The California Psychological Inventory manual (3rd 
ed.). Palo Alto, CA: Consulting Psychologists Press. 
Griffith, R., Chmielowski, T., Yoshita, Y. (2007). Do applicants fake? An examination of 
the frequency of applicant faking behavior. Personnel Review, 36, 341-355. 
Griffith, R. L., Malm, T., English, A., Yoshita, Y., & Gujar, A. (2006). Applicant faking 
behavior: Teasing apart the influence of situational variance, cognitive biases, and 
individual differences. In R. L., Griffith & M. H. Peterson (Eds.), A closer 
examination of applicant faking behavior (pp. 151-178). Greenwich, CT: 
Information Age.  
Griffith, R. L., & McDaniel, M. (2006). The nature of deception and applicant faking 
behavior. In Griffin, R. L. and Peterson, M. H. (Eds.), A closer examination of 
applicant faking behavior (pp. 1-20). Information Age Publishing: Greenwich, 
CT. 
Guion, R., M., & Gottier, R. F. (1965). Validity of personality measures in personnel 
selection. Personnel Psychology, 18, 135-164. 
Hair, J. F., Jr., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). 
Multivariate Data Analysis (6th ed.). Upper Saddle River, NJ: Pearson Prentice 
Hall. 
Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: Principles and 
applications. Boston, MA: Kluwer-Nijhoff. 
Henry, M. S., & Raju, N. S. (2006). The effects of traited and situational impression 
management on a personality test: An empirical analysis. Psychology Science, 48, 
247-267. 
 
82 
 
Highhouse, S. & Becker, A. S. (1993). Facet measures and global job satisfaction. 
Journal of Business and Psychology, 8, 117-127. 
Hogan, R. (1982). A socioanalytic theory of personality. In M. M. Page (Ed.), 1982 
Nebraska symposium of motivation, (pp. 55-89). Lincoln: University of Nebraska 
Press. 
Hogan, J., & Holland, B. (2003). Using theory to evaluate personality and job-
performance relations: A socioanalytic perspective. Journal of Applied 
Psychology, 88, 100-112. 
Hogan, J. Barrett, P., & Hogan, R. (2007). Personality measurement, faking, and 
employment selection. Journal of Applied Psychology, 92, 1270-1285. 
Hogan, R. (2005). In defense of personality measurement. Human Performance, 18, 331-
341. 
Hough, L.M., Oswald, F.L., Ployhart, R.E. (2001). Determinants, detection and 
amelioration of adverse impact in personnel selection procedures: Issues, 
evidence and lessons learned. International Journal of Selection and Assessment, 
9, 152-194. 
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The five 
revisited. Journal of Applied Psychology, 85, 869-879. 
John, O. P., & Donahue, E. M. (1994). The Big Five Inventory: Technical report of the 
44-item version. Berkeley: Institute of Personality and Social Research, 
University of California, 1994. 
 
83 
 
Li, A., & Bagger, J. (2006). Using the BIDR to distinguish the effects of impression 
management and self-deception on the criterion validity of personality measures: 
A meta-analysis. International Journal of Selection and Assessment, 14, 131-141. 
Lucius, R. H. (April, 2003). Technical report for the Fitability 5 personality profiles. 
Atlanta, GA: Fitability Systems, LLC. 
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Oxford, 
England: Addison-Wesley. 
McAdams, D. P. (1992). The five-factor model in personality: A critical appraisal. 
Journal of Personality, 60, 329-361. 
McCrae, R. R., & Costa, P. T. (1987). Validation of the five-factor model of personality 
across instruments and observers. Journal of Personality and Social Psychology, 
52, 81-90.  
McFarland, L. A., & Ryan, A. M. (2000). Variance in faking across noncognitive 
measures. Journal of Applied Psychology, 85, 812-821.   
Mitchelson, J. K., Wicher, E. W., LeBreton, J. M., & Craig, S. B. (2009). Gender and 
ethnicity differences on the Abridged Big Five Circumplex (AB5C) of personality 
traits. Education and Psychological Measurement, 69, 613-635.  
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & 
Schmitt, N. (2007). Reconsidering the use of personality tests in personnel 
selection contexts. Personnel Psychology, 60, 683-729. 
Mount, M. K., & Barrick, M. R. (1995). The big five personality dimensions: 
Implications for research and practice in human resource management. Research 
in Personnel and Human Resource Management, 13, 153-200. 
 
84 
 
Mueller-Hanson, R., Heggestad, E. D., & Thornton, G. C. (2006). Individual differences 
in impression management: An exploration of the psychological processes 
underlying faking.  Psychological Science, 48, 209-225. 
Norman, W. T. (1963). Toward an adequate taxonomy of personality attributes: 
Replicated factor structure in peer nomination personality ratings. Journal of 
Abnormal and Social Psychology, 66, 574-583. 
Ones, D. S., Dilchert, S., Viswesvaran, C., & Judge, T. A. (2007). In support of 
personality assessment in organizational settings. Personnel Psychology, 60, 995-
1027. 
Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in 
personality testing for personnel selection: The red herring. Journal of Applied 
Psychology, 81, 660-679. 
Oshima, T. C., Kushubar, S., Scott, J. C., & Raju, N. S. (2009). DFIT8 for Windows 
user?s manual: Differential functioning of items and tests. St. Paul, MN: 
Assessment Systems Corporation. 
Paulhus, D. P. (1984). Two-component models of socially desirable responding. Journal 
of Personality and Social Psychology, 46, 598-609. 
Paulhus, D. P. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. 
Shaver, and L. S. Wrightsman (Eds.), Measures of personality and social 
psychology attitudes (pp. 17-59). New York: Academic Press.  
Paulhus, D. L. (1998). Manual for the Balanced Inventory of Desirable Responding 
(BIDR-7). Toronto, Ontario, Canada: Multi-Health Systems.  
 
85 
 
Peterson, M. H., Griffith, R. L., O?Connell, M. S., & Isaacson, J. A. (2008, April). 
Examining faking in real job applicants: A within-subjects investigation of score 
changes across applicant and research settings. In R. L. Griffith & M. H. Peterson 
(Chairs), Examining faking using within-subjects designs and applicant data. 
Symposium conducted at the 23rd Annual Conference for the Society for 
Industrial and Organizational Psychology: San Francisco, CA. 
Raju, N. (2000). Notes accompanying the differential functioning of items and tests 
(DFIT) [Computer software]. 
Raju, N. S., van der Linden, & Fleer, P. F. (1995). IRT-based internal measures of 
differential functioning of items and tests. Applied Psychological Measurement, 
19, 353-368. 
Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results 
and implications. Journal of Educational Statistics, 4, 207-230. 
Robie, C., Zickar, M. J., & Schmit, M. J. (2001). Measurement equivalence between 
applicant and incumbent groups: An IRT analysis of personality scales. Human 
Performance, 14, 187-207. 
Rosse, J. G., Stecher, M. D., Miller, J. L., & Levin, R. A. (1998). The impact of response 
distortion on preemployment personality testing and hiring decision. Human 
Performance, 14, 187-207. 
Rothstein, M., & Goffin, R. D. (2006). The use of personality measures in personnel 
selection: What does current research support? Human Resource Management 
Review, 16, 155-180. 
 
86 
 
Rousseau, D. M., & Fried, Y. (2001). Location, location, location: Contextualizing 
organizational research. Journal of Organizational Behavior, 22, 1-13. 
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded 
scores. Psychometrika Monograph Supplement, 34, 100-114. 
Scarpello, V. & Campbell, J. P. (1983). Job satisfaction: Are all the parts there? 
Personnel Psychology, 36, 577-600. 
Schmidt, F., & Hunter, J. (1998). The validity and utility of selection methods in 
personnel psychology: Practical and theoretical implications of 85 years of 
research findings. Psychological Bulletin, 124, 262-274. 
Schmit, M. J., & Ryan, A. M. (1993). The Big Five in personnel selection: Factor 
structure in applicant and non-applicant populations. Journal of Applied 
Psychology, 78, 966-974. 
Smith, G. M. (1967). Usefulness of peer ratings of personality in educational research. 
Educational and Psychological Measurement, 27, 967-984. 
Smith, D. B., & Ellingson, J. E. (2002). Substance versus style: A new look at social 
desirability in motivating contexts. Journal of Applied Psychology, 87, 211-219.  
Stark, S., Chernyshenko, O. S., Chan, K. Y., Lee, W. C., & Drasgow, F. (2001). Effects 
of the testing situation on item responding. In B. Schneider, D. B., Smith, A. P. 
Brief, & J. P. Walsh (Eds.), Personality and Organizations. Mahwah, NJ: 
Lawrence Erlbaum Associates. 
Teague, S. & Thomas, A. (2008, April). Intelligence and mood state influence faking 
behavior on personality tests. Poster session presented at the annual meeting of 
the Society for Industrial and Organizational Psychology, New Orleans, LA.  
 
87 
 
Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A response 
to Morgeson, Campion, Dipboye, Hollenbeck, Murphy, & Schmitt (2007). 
Personnel Psychology, 60, 967-993. 
Tett, R. P., Jackson, D., & Rothstein, M. (1991). Personality measures as predictors of 
job performance: A meta-analytic review. Personnel Psychology, 44, 703-742. 
Thissen, D., Chen, W-H, & Bock, R.D. (2003). Multilog (version 7) [Computer 
software]. Lincolnwood, IL: Scientific Software International.  
Tupes, E. C., & Christal, R. E. (1961). Recurrent personality factors based on trait 
ratings. USAF ASD Tech. Rep. 61-97. 
Viswesvaran, C., Deller, J, & Ones, D. S. (2007). Personality measures in personnel 
selection: Some new contributions. International Journal of Selection and 
Assessment, 15, 354-358. 
Viswesvaran, C., & Ones, D. (1999). Meta-analyses of fakability estimates: Implications 
for personality measurement. Educational and Psychological Measurement, 59, 
197-210. 
Viswesvaran, C., Ones, D. S. & Hough, L.M. (2001) Do impression management scales 
in personality inventories predict managerial job performance ratings? 
International Journal of Selection and Assessment, 9, 277?289. 
Winkelspecht, C., Lewis, P., & Thomas, A. (2006). Potential effects of faking on the 
NEO-PI-R: Willingness and ability to fake changes who gets hired in simulated 
selection decisions. Journal of Business and Psychology, 21, 243-259. 
Zickar, M. J., & Robie, C. (1999). Modeling faking good on personality items: An item-
level analysis. Journal of Applied Psychology, 84, 551-563.  
 
88 
 
 
 
 
Appendix A 
Figures 
 
89 
 
Figure Captions 
Figure 1. Item Response Function (Item Characteristic Curve) 
Figure 2. Category Response Functions for a Five Category Item 
Figure 3. Scree plot and Eigenvalues for Impression Management Scale Development   
 
 
90 
 
Figure 1 
 
Parameters a, b, and c represent the item discrimination (slope), item difficulty, and lower 
asymptote, respectively.  
 
91 
 
Figure 2 
Category Response Functions for a Five Category Item 
  
Response functions 1-5 represent the probability for selecting each of the five response 
choices.
 
92 
 
Figure 3 
Scree plot and Eigenvalues for Impression Management Scale Development 
 
93 
 
 
 
 
Appendix B 
Tables 
 
94 
 
Table 1 
Validity of Selection Tests Commonly Used for Predicting Overall Job Performance 
Selection Test Validity Validity of Test Plus g Gain in Validity 
General mental ability (g) .51 ? ? 
Work sample .54 .63 .12 
Structured interview .51 .63 .12 
Integrity .41 .65 .14 
Assessment centers .37 .53 .02 
Biographical data .35 .52 .01 
Conscientiousness .31 .60 .09 
Adapted from Frank L. Schmidt and John E. Hunter, ?The validity and utility of selection 
methods in personnel psychology: Practical and theoretical implications of 85 years of 
research findings,? Psychological Bulletin 124 (1998): 262-274. 
 
95 
 
Table 2 
DFIT Research on Personality Test Faking and Measurement Equivalence 
Study Fakers vs. Non-Fakers Personality Measure Results 
Flanagan & 
Raju (1997) High vs. ave. IM 
16-PF 
(extraversion) ME 
Zickar & 
Robie (1999) Honest vs. fake instructions ABLE DIF & DTF 
Robie et al. 
(2001) Applicants vs. incumbents PPI ME 
Stark et al. 
(2001) 
a. Applicants vs. incumbents 
b. High vs. low IM 16-PF 
DIF & DTF (a) 
ME (b) 
Henry & Raju 
(2006) 
a. Applicants vs. incumbents 
b. High vs. low IM 
CPI 
(conscientiousness) ME 
IM = Impression management; ME = Measurement equivalence; DIF = Differential item 
functioning; DTF = Differential test functioning 
 
 
96 
 
Table 3 
 
Items, Factor Loadings, and Internal Consistency Estimates for the Four-Factor Solution 
 
Items   Factor Loadings 
Factor 1: Social Conventions (? = .73)      
I never listen in on other people?s private conversations. .52    
I always tell the truth. .48    
I have lied to get myself out of trouble. (r) .44    
I rarely gossip. .44    
I sometimes talk bad about my friends behind their back. (r) .36    
I find it easy to resist temptations. .35    
I sometimes break the rules to get ahead. (r) .34    
I always know why I do the things I do. .33    
Factor 2: Reputation (? = .74)      
I never worry about what people think of me.  .72   
It does not upset me that some people do not like me.  .66   
It is easy to hurt my feelings. (r)  .48   
It really bothers me when people talk about me behind my 
back. (r)  .48   
Factor 3: Responsibility (? = .72)      
People often tell me I work too hard.    .43  
I am always responsible.   .59  
I keep all of my paperwork filed.   .57  
I can get a lot more tasks accomplished compared to others.   .65  
Too much planning makes life boring. (r)   .46  
I am very disciplined.   .64  
Factor 4: Emotions (? = .74)      
I often feel sorry for myself. (r)    .47 
I sometimes try to get even rather than forgive and forget. (r)    .36 
I tend to focus on the worst case scenario. (r)    .44 
When someone criticizes my work, it feels like a direct attack 
on me as a person. (r)    .46 
I have had emotional outbursts in public. (r)    .43 
I get angry more than I should. (r)    .57 
I usually get impatient if I have to wait. (r)    .34 
I sometimes think that people are laughing at me. (r)    .60 
I sometimes feel I am treated harshly without cause. (r)  
   
.65 
 
97 
 
Table 4 
 
Correlations among the Four Factors and the BIDR Scales 
 
 
IM 
(BIDR) 
SDE 
(BIDR) 
Scale 
1 
Scale  
2 
Scale  
3 
Scale  
4 
Impression Management  
(IM; BIDR)  1 
     
     
Self-Deception Enhancement 
(SDE; BIDR)    -.42** 1 
    
    
Scale 1: Social Conventions    .65**   -.53** 1       
Scale 2: Reputation    .03   -.42**  .39** 1     
Scale 3: Responsibility     .28**   -.21**  .22** -.08 1   
Scale 4: Emotions    .43**   -.40**  .37**  .39** -.01 1 
** Significant at p < .01 
 
98 
 
Table 5. 
Impression Management Item and Scale Means and Standard Deviations 
 
 
Item/Scale 
Total Sample Male Female Caucasian African American Hispanic 
Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD 
I never listen in on other 
people?s private conversations. 4.49 0.78 4.48 0.78 4.54 0.75 4.47 0.76 4.57 0.75 4.46 0.83 
I always tell the truth. 4.02 1.37 4.04 1.16 3.95 1.24 4.08 1.08 3.94 1.31 3.89 1.33 
I have lied to get myself out of 
trouble. (r) 4.17 0.89 4.18 0.88 4.14 0.94 4.10 0.89 4.28 0.90 4.30 0.85 
I rarely gossip. 4.61 0.65 4.61 0.65 4.62 0.64 4.65 0.60 4.49 0.76 4.62 0.62 
I sometimes talk bad about my 
friends behind their back. (r) 3.97 1.11 3.97 1.10 3.98 1.14 3.92 1.06 4.02 1.21 4.13 1.10 
I find it easy to resist 
temptations. 4.66 0.68 4.66 0.68 4.69 0.66 4.64 0.68 4.72 0.65 4.74 0.63 
I sometimes break the rules to 
get ahead. (r) 4.24 1.06 4.23 1.07 4.32 1.01 4.25 1.04 4.22 1.11 4.32 1.01 
I always know why I do the 
things I do. 4.07 1.28 4.08 1.28 3.99 1.31 4.10 1.23 4.03 1.39 3.98 1.37 
Total Scale 4.28 0.53 4.28 0.53 4.28 0.53 4.28 0.52 4.28 0.56 4.30 0.52 
 
99 
 
Table 6. 
Scale Means and Standard Deviations for the Total Sample and High/Low Impression Management Subgroups 
Sample/subgroup  
 
Agreeableness 
 
Conscientiousness 
 
Extraversion 
 
Neuroticism 
 
Openness 
Impression 
management 
Total sample Mean 4.08 4.18 3.48 2.34 4.00 4.28 
 SD 0.46 0.43 0.51 0.50 0.48 0.53 
        
High IM group Mean 4.30 4.44 3.54 2.04 4.22 4.91 
 SD 0.46 0.36 0.54 0.46 0.48 0.10 
Low IM group Mean  3.88 3.95 3.43 2.62 3.80 3.59 
 SD 0.41 0.42 0.49 0.44 0.43 0.24 
 Effect size 0.98 1.27 0.19 1.31 0.94 7.13 
Total sample size = 21, 017. Impression management (IM) subgroups sample sizes = 5,000. Effect sizes estimated with 
Cohen?s d. 
 
 
100 
 
Table 7. Correlations among the Variables for the Total Sample 
 
 1 2 3 4 5 6 
1. Agreeableness 1           
2. Conscientiousness .41** 1         
3. Extraversion .30** .22** 1       
4. Neuroticism -.26** -.28** -.04** 1     
5. Openness .46** .43** .40** -.25** 1   
6. Impression Management .36** .44** .09** -.46** .34** 1 
** Significant at p < .01 
 
101 
 
Table 8. Correlations among the Variables for the High Impression Management Group 
 
 1 2 3 4 5 6 
1. Agreeableness 1           
2. Conscientiousness .31** 1         
3. Extraversion .30** .19** 1       
4. Neuroticism -.15** -.13** .01 1     
5. Openness .30** .29** .30** -.11** 1   
6. Impression Management .20** .25** .05** -.26** .29** 1 
** Significant at p < .01 
 
102 
 
Table 9. Correlations among the Variables for the Low Impression Management Group 
 
 1 2 3 4 5 6 
1. Agreeableness 1           
2. Conscientiousness .27** 1         
3. Extraversion .31** .16** 1       
4. Neuroticism -.05** -.09** -.04** 1     
5. Openness .39** .26** .36** -.05** 1   
6. Impression Management .18** .28** .04* -.24** .13** 1 
* Significant at p < .05. *** Significant at p < .01.  
 
 
103 
 
Table 10. 
Agreeableness Item Means and NCDIF, CDIF, and DTF Values for Fakers versus Non-fakers 
 
Agreeableness Items 
Faking 
Mean 
Non-Faking 
Mean 
 
NCDIF 
 
CDIF 
 
DTF 
2. I am a charitable person 3.68 3.54 0.247 2.18 (2) 11.57 
6. I am fanatical about finishing all tasks, no matter how trivial 4.77 4.52 0.047   
10. I am not particularly creative 3.03 2.74 0.047   
14. I am very careful with decisions, even ones others might think are?  4.60 4.17 0.166 1.79 (8) 0.47 
22. I have a forgiving nature 4.55 3.93 0.068   
27. I like to have a plan and be organized before starting work 4.40 4.10 0.211 2.01 (5) 3.89 
32. I often find myself taking charge of a situation or project 4.63 4.14 0.208 1.98 (6) 2.33 
37. I sometimes talk too much 4.28 3.77 0.274 2.31 (1) 15.16 
42. My mood is stable regardless of the situation 4.44 3.99 0.238 2.15 (4) 5.89 
47. People often look to me to make important decisions 4.46 3.89 0.200 1.96 (7) 1.17 
54. When someone asks for a favor it is hard for me to say no?  4.42 3.85 0.239 2.16 (3) 8.48 
Item numbers correspond to their order on the Fitability 5a. NCDIF = noncompensatory differential item functioning. Bolded NCDIF 
values are statistically significant at the 0.01 alpha level. CDIF = compensatory differential item functioning. DTF = differential test 
functioning. Numbers in parentheses represent the order in which CDIF items were removed to achieve non-significant DTF. The 
critical value for DTF was 1.056 for this scale.  
 
 
104 
 
Table 11. 
Conscientiousness Item Means and NCDIF, CDIF, and DTF Values for Fakers versus Non-fakers 
 
Conscientiousness Items 
Faking 
Mean 
Non-Faking 
Mean 
 
NCDIF 
 
CDIF 
 
DTF 
4. I am careful in all of my decisions 4.83 4.42 0.188   
13. I am too busy to be reflective 4.51 4.06 0.230 2.79 (8) 1.25 
18. I enjoy serious conversations about life and philosophy 4.75 4.21 0.351 3.41 (5) 7.49 
25. I like to be the center of attention 4.85 4.38 0.301 3.07 (7) 2.69 
30. I often analyze my thoughts and feelings 4.47 3.94 0.464 3.98 (1) 26.68 
35. I often worry too much 3.70 3.49 0.381 3.54 (3)  15.25 
40. I?d rather stay flexible than to always have everything planned out 4.80 4.13 0.347 3.43 (4) 11.02 
45. Others see me as very social 4.59 3.97 0.198 2.57 (9) 0.53 
50. Something has to be very important before I worry much about it 2.97 2.76 0.094   
52. Though I?m sometimes harsh, people appreciate that I ?tell it like it is" 4.64 3.98 0.344 3.37 (6) 4.68 
55. When traveling I tend to make plans well in advance 4.73 4.07 0.434 3.83 (2) 20.35 
Item numbers correspond to their order on the Fitability 5a. NCDIF = noncompensatory differential item functioning. Bolded NCDIF 
values are statistically significant at the 0.01 alpha level. CDIF = compensatory differential item functioning. DTF = differential test 
functioning. Numbers in parentheses represent the order in which CDIF items were removed to achieve non-significant DTF. The 
critical value for DTF was 1.056 for this scale. 
 
 
105 
 
Table 12. 
Extraversion Item Means and NCDIF Values for Fakers versus Non-fakers 
 
Extraversion Items 
Faking 
Mean 
Non-Faking 
Mean 
 
NCDIF 
1. Good planning is more important than flexibility 4.13 3.91 0.008 
5. I am curious about many different things 4.77 4.20 0.004 
9. I am not moody 3.56 3.68 0.005 
17. I don't like working with abstract concepts 4.70 4.20 0.003 
21. I feel my best when I am around large groups of people 2.22 2.49 0.007 
26. I like to clean my desk each day before leaving work 4.41 3.95 0.008 
31. I often do favors for others 4.65 4.03 0.003 
36. I seek thrills and excitement 3.54 3.15 0.007 
41. It is ok to stop working on a job if you are getting nowhere with it 3.73 3.59 0.008 
46. People know right away if I?m in a good or bad mood 1.93 2.44 0.002 
51. Sometimes when I?m concerned or upset about something important?  2.69 2.79 0.002 
53. When meeting someone new, I am usually the first to introduce myself 2.11 2.78 0.005 
Item numbers correspond to their order on the Fitability 5a. NCDIF = noncompensatory differential item functioning.  
 
 
 
 
106 
 
Table 13. 
Neuroticism Item Means and NCDIF, CDIF, and DTF Values for Fakers versus Non-fakers 
 
Neuroticism Items 
Faking 
Mean 
Non-Faking 
Mean 
 
NCDIF 
 
CDIF 
 
DTF 
3. I am always willing to listen to my friends problems 1.68 2.30 0.078   
7. I am generally trusting 2.27 2.75 0.902 6.31 (2) 22.09 
11. I am quick to forgive my friends 1.17 1.86 0.196 2.94 (8) 0.76 
15. I can talk for long periods of time with friends, acquaintances?  3.11 2.87 0.081   
19. I enjoy telling jokes and stories at parties 1.29 2.16 0.406 4.23 (6) 3.19 
23. I have an active imagination 1.97 2.65 0.415 4.27 (5) 5.84 
28. I need to be around other people if I?ve been alone for several hours 1.31 2.23 0.844 6.01 (3) 14.48 
33. I often get my own way 1.98 2.66 1.007 6.64 (1) 31.91 
38. I take some time each week to organize my workspace 3.21 3.53 0.227 3.12 (7) 1.72 
43. My mood often goes up and down 2.42 3.08 0.555 4.94 (4) 9.36 
48. People say I worry about things that are not important 1.98 2.90 0.094   
Item numbers correspond to their order on the Fitability 5a. NCDIF = noncompensatory differential item functioning. Bolded NCDIF 
values are statistically significant at the 0.01 alpha level. CDIF = compensatory differential item functioning. DTF = differential test 
functioning. Numbers in parentheses represent the order in which CDIF items were removed to achieve non-significant DTF. The 
critical value for DTF was 1.056 for this scale. 
 
 
 
 
 
 
107 
 
Table 14. 
Openness Item Means and NCDIF, CDIF, and DTF Values for Fakers versus Non-fakers 
 
Openness Items 
Faking 
Mean 
Non-Faking 
Mean 
 
NCDIF 
 
CDIF 
 
DTF 
8. I am interested in other people's culture and perspectives 4.09 3.69 0.219 2.09 (5) 3.81 
12. I am regarded as very, very nice, warm, pleasant and tender-hearted 4.74 4.21 0.179 1.88 (8) 0.40 
16. I don?t mind being criticized 4.18 3.82 0.212 2.07 (6) 2.22 
20. I enjoy theoretical work 3.62 3.32 0.086   
24. I keep working on a task even when it appears that I?m not?  4.14 3.73 0.322 2.55 (1) 15.43 
29. I never get upset when other people ridicule and tease me 4.44 3.82 0.236 2.18 (4) 5.84 
34. I often think/rethink about how I should have said/done?  4.47 4.14 0.214 2.05 (7) 1.08 
39. I will criticize someone in public if they deserve it 4.10 3.90 0.278 2.37 (2) 11.58 
44. Other see me as kind and sympathetic 3.89 3.33 0.144   
49. People see me as creative and inventive 4.58 4.00 0.255 2.26 (3) 8.41 
Item numbers correspond to their order on the Fitability 5a. NCDIF = noncompensatory differential item functioning. Bolded NCDIF 
values are statistically significant at the 0.01 alpha level. CDIF = compensatory differential item functioning. DTF = differential test 
functioning. Numbers in parentheses represent the order in which CDIF items were removed to achieve non-significant DTF. The 
critical value for DTF was 0.96 for this scale. 
 
 
108 
 
Table 15 
Impression Management Item Means and Standard Deviations by Gender and Race 
  Impression Management Items* 
Subgroup  56. 57. 58. 59. 60. 61. 62. 63. 
A1. Males 1 Mean 4.48 4.02 4.19 4.62 3.97 4.65 4.25 4.12 
 SD 0.79 1.17 0.87 0.62 1.10 0.69 1.05 1.25 
          
A2. Males 2 Mean  4.48 4.04 4.19 4.62 3.97 4.66 4.23 4.11 
 SD 0.78 1.16 0.88 0.64 1.11 0.67 1.07 1.26 
          
B. Females Mean 4.54 3.95 4.14 4.62 3.98 4.69 4.33 3.99 
 SD 0.75 1.24 0.94 0.64 1.14 0.66 1.01 1.31 
          
A1 vs. B Effect size 0.08 0.06 0.06 0.00 0.01 0.06 0.08 0.16 
 NCDIF 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
A2 vs. B Effect size 0.08 0.07 0.05 0.00 0.01 0.05 0.10 0.17 
 NCDIF 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
          
C1. Caucasians 1 Mean 4.47 4.09 4.09 4.64 3.90 4.63 4.25 4.09 
 SD 0.77 1.06 0.90 0.60 1.07 0.69 1.04 1.23 
          
C2. Caucasians  2 Mean  4.47 4.07 4.10 4.65 3.92 4.63 4.26 4.09 
 SD 0.76 1.08 0.89 0.59 1.07 0.68 1.02 1.23 
          
D. Afr Americans Mean 4.57 3.94 4.28 4.49 4.02 4.72 4.22 4.03 
SD 0.75 1.31 0.90 0.76 1.21 0.65 1.11 1.39 
          
C1 vs. D Effect size 0.13 0.13 0.21 0.22 0.11 0.13 0.03 0.05 
 NCDIF 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 
C2 vs. D Effect size 0.13 0.11 0.20 0.24 0.09 0.14 0.04 0.05 
 NCDIF 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 
          
E. Hispanics Mean 4.46 3.89 4.30 4.62 4.13 4.74 4.32 3.98 
 SD 0.83 1.33 0.85 0.62 1.10 0.63 1.01 1.37 
          
C1 vs. E Effect size 0.01 0.17 0.24 0.03 0.21 0.17 0.07 0.08 
 NCDIF 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
C2 vs. E Effect size 0.01 0.15 0.23 0.05 0.19 0.17 0.06 0.08 
 NCDIF 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
*Impression management items appear in Appendix A. Item numbers correspond to their 
order on the Fitability 5a. NCDIF = noncompensatory differential item functioning  
 
109 
 
 
 
 
Appendix C 
Measures 
 
110 
 
Modified 55 Item Pool for Inclusion in Scale Development  
 
Respondents answer using the following 5-point scale 
 
1 2 3 4 5 
Strongly  
agree 
Moderately  
agree 
No  
opinion 
Moderately  
disagree 
Strongly 
 agree 
 
1. I am always good to others.  
2. I often feel sorry for myself. 
3. I always try to practice what I preach. 
4. I am often the ?peace maker? when arguments occur among my friends. 
5. I sometimes try to get even rather than forgive and forget.  
6. I usually say exactly what is on my mind.  
7. I have little trouble making new friends. 
8. People often tell me I work too hard.  
9. I am the first to admit when I make a mistake. 
10. I act differently around different people.  
11. I am always responsible. 
12. I tend to focus on the worst case scenario.  
13. I am sometimes rude. 
14. I sometimes tell lies.  
15. I have never tried to cover up a mistake. 
16. When someone criticizes my work, it feels like a direct attack on me as a person.  
17. There have been occasions when I have taken advantage of someone to get ahead.  
18. I have had emotional outbursts in public.  
19. I get angry more than I should.  
20. I keep all of my promises. 
21. Being told not to do something makes me want to do it even more 
22. I keep all of my paperwork filed. 
23. I never worry about what people think of me. 
24. It does not upset me that some people do not like me. 
25. I have received too much change from a cashier without telling him or her.  
 
111 
 
26. I sometimes break the rules to get ahead.  
27. I will do anything for others. 
28. It is easy to hurt my feelings.  
29. I can get a lot more tasks accomplished compared to others. 
30. I sometimes take things that do not belong to me.  
31. I find it easy to resist temptations. 
32. Too much planning makes life boring.  
33. I have used flattery to get ahead. 
34. I always know why I do the things I do. 
35. I am not concerned with making a good impression on people.  
36. I usually get impatient if I have to wait.  
37. I always tell the truth. 
38. I never listen in on other people?s private conversations. 
39. I am a persistent and steady worker. 
40. I sometimes talk bad about my friends behind their back.  
41. I have lied to get myself out of trouble.  
42. I have never revealed someone else?s secret. 
43. I am always willing to lend a hand. 
44. It really bothers me when people talk about me behind my back.  
45. I work hard on all jobs that I undertake. 
46. It is important for me to do my best. 
47. I am always nice to others. 
48. If I have mistreated someone, I can hardly bear to face him or her again.  
49. I get concerned when someone I am expecting does not show up on time. 
50. I sometimes think that people are laughing at me.  
51. I sometimes feel I am treated harshly without cause.  
52. I rarely gossip. 
53. I try to avoid using profanity. 
54. I am very disciplined. 
55. I have done things that I prefer to be kept secret.  
 
 
112 
 
Fitability 5a 
Instructions: Below are a series of statements that broadly describe an individual?s 
personality. Indicate whether you agree or disagree with each statement as it applies to 
you by selecting the appropriate response. There are no right or wrong answers, nor is 
there an ?ideal? response for each question. Attempting to misrepresent your true 
personality may actually work against you. The best approach is to simply respond 
truthfully. Do not think too much about your answer ? go with your first impression.  
 
Items are rated on a five-point scale: 1 = Strongly agree to 5 = Strongly disagree 
1 Good planning is more important than flexibility 
2 I am a charitable person 
3 I am always willing to listen to my friends problems 
4 I am careful in all of my decisions 
5 I am curious about many different things 
6 I am fanatical about finishing all tasks, no matter how trivial 
7 I am generally trusting 
8 I am interested in other people's culture and perspectives 
9 I am not moody 
10 I am not particularly creative 
11 I am quick to forgive my friends 
12 I am regarded as very, very nice, warm, pleasant and tender-hearted 
13 I am too busy to be reflective 
14 I am very careful with decisions, even ones others might think are unimportant 
15 I can talk for long periods of time with friends, acquaintances, coworkers? just 
about anyone 
16 I don?t mind being criticized 
17 I don't like working with abstract concepts 
18 I enjoy serious conversations about life and philosophy 
19 I enjoy telling jokes and stories at parties 
20 I enjoy theoretical work 
21 I feel my best when I am around large groups of people 
22 I have a forgiving nature 
23 I have an active imagination 
24 I keep working on a task even when it appears that I?m not making much 
progress 
25 I like to be the center of attention 
26 I like to clean my desk each day before leaving work 
 
113 
 
27 I like to have a plan and be organized before starting work 
28 I need to be around other people if I?ve been alone for several hours 
29 I never get upset when other people ridicule and tease me 
30 I often analyze my thoughts and feelings 
31 I often do favors for others 
32 I often find myself taking charge of a situation or project 
33 I often get my own way 
34 I often think and rethink about how I should have said or done something better 
35 I often worry too much 
36 I seek thrills and excitement 
37 I sometimes talk too much 
38 I take some time each week to organize my workspace 
39 I will criticize someone in public if they deserve it 
40 I?d rather stay flexible than to always have everything planned out 
41 It is ok to stop working on a job if you are getting nowhere with it 
42 My mood is stable regardless of the situation 
43 My mood often goes up and down 
44 Other see me as kind and sympathetic 
45 Others see me as very social 
46 People know right away if I?m in a good or bad mood 
47 People often look to me to make important decisions 
48 People say I worry about things that are not important 
49 People see me as creative and inventive 
50 Something has to be very important before I worry much about it 
51 Sometimes when I?m concerned or upset about something important, others 
don?t seem to understand or care 
52 Though I?m sometimes harsh, people appreciate that I ?tell it like it is" 
53 When meeting someone new, I am usually the first to introduce myself 
54 When someone asks for a favor it is hard for me to say no ? even if it is 
inconvenient 
55 When traveling I tend to make plans well in advance 
 
114 
 
Balanced Inventory of Desirable Responding 
Using the scale below as a guide, please respond to each statement to indicate how much 
you agree with it. 
 
1 - - - - - 2 - - - - - 3 - - - - - 4 - - - - - 5 - - - - - 6 - - - - - 7 
NOT TRUE          SOMEWHAT TRUE         VERY TRUE 
 
1 My first impressions of people usually turn out to be right. 
2 It would be hard for me to break any of my bad habits. 
3 I don?t care to know what other people really think of me. 
4 I have not always been honest with myself. 
5 I always know why I like things. 
6 When my emotions are aroused, it biases my thinking. 
7 Once I?ve made up my mind, other people can seldom change my opinion. 
8 I am not a safe driver when I exceed the speed limit. 
9 I am fully in control of my own fate. 
10 It?s hard for me to shut off a disturbing thought. 
11 I never regret my decisions. 
12 I sometimes lose out on things because I can?t make up my mind soon enough. 
13 The reason I vote is because my vote can make a difference. 
14 My parents were not always fair when they punished me. 
15 I am a completely rational person. 
16 I rarely appreciate criticism. 
17 I am very confident of my judgments. 
18 It?s all right with me if some people happen to dislike me. 
19 I don?t always know the reasons why I do the things I do. 
20 I sometimes tell lies if I have to. 
21 I never cover up my mistakes. 
22 There have been occasions when I have taken advantage of someone. 
23 I never swear. 
24 I sometimes try to get even rather than forgive and forget. 
25 I always obey laws, even if I?m unlikely to get caught. 
26 I have said something bad about a friend behind his or her back. 
27 When I hear people talking privately, I avoid listening. 
28 I have received too much change from a sales person without telling him or her. 
29 I always declare everything at customs. 
30 When I was young I sometimes stole things. 
31 I have never dropped litter on the street. 
 
115 
 
32 I sometimes drive faster than the speed limit. 
33 I have done things that I don?t tell other people about. 
34 I never take things that don?t belong to me. 
35 I have taken sick-leave from work or school even though I wasn?t really sick. 
36 I have never damaged a library book or stole merchandise without reporting it. 
37 I have some pretty awful habits. 
38 I don?t gossip about other people?s business.