It is well established among those who carry out, analyze, and report pre-employment performance testing that slope-based bias in those tests is rare. Why is this important? Look at the following three graphs from a recent study by Aguinis, Culpepper and Pierce (2010):
Figure 1A shows the idealized scenario that is assumed to be the case most of the time: Minority populations have a lower performance result than majority populations, however both have the same shaped distribution, and thus the same shaped regression line (same slope, equally straight) but with a reduced overall value and thus a lower y-intercept (where the lines hit the up and down axis on the left).
Figures 1B and 1C show what is assumed does NOT happen. In this case, the minority group has a lower (flatter) slope, and this actually brings the y-intercept closer to the Majority’s value.
In real life situations, one might want to account for the y-intercept (a measure of the overall, or average, performance) so that minority and majority groups have the same height line. The myriad reasons for doing this are not important right now, just assume that we believe it to be fair to make sure that the absolute best person with purple skin has the same chance of getting a job as the absolute best person with green skin (where one is the majority and one is the minority). If there is a systematic performance difference, it may well be because of something consistent between the groups that we don’t care about but want to adjust for.
Psychometrics experts have long contended that we can do this by simply adjusting the y-intercept (the up and down part) of the line without negative consequence. If, however, they are wrong, and the real life situation looks more like B or C in the figures above, this would be bad. It would mean that people in the minority group who perform at the top of their game would still be under-measured, less likely to get the job, and the subject of bias against them.
The established wisdom appears to be wrong
Well, it turns out that the very study that provided these pretty graphs, “Revival of Test Bias Research in Preemployment Testing” by Herman Aguinis, Steven Culpepper and Charles Pierce (published in the Journal of Applied Psychology) strongly suggests that we’ve gotten this wrong all along. It is not safe to assume that there is no bias in slope in these tests, and in fact, there is reason to expect that there usually, or at least often, is a slope difference despite the fact that the opposite has been “well established.”
… these established conclusions are not consistent with recent … research showing that [testing for] slope-based differences is usually conducted with insufficient levels of statistical power, which can lead to the incorrect conclusion that bias does not exist… Also, these established conclusions are not consistent with expectations … that sociohistorical- cultural and social psychological mechanisms are likely to lead to slope-based differences across groups.
… and not just y-intercept differences. Meaning, an adjustment assuming no difference in slope would result in a bias against the group with the shallower slope when it comes to actually doling out jobs or promotions.
The study provides a new and very sophisticated look at the statistics underlying this sort of analysis and strongly suggests that “…intercept-based differences favoring minority group members is a result of a statistical artifact. In fact …. we would expect to find artifactual intercept-based differences favoring the minority group even if these differences do not exist in the population.”
An important conclusion of this study, and a rather startling one for most people in the field or who rely on psychometric research to justify their race-based (racist) agendas, is that intercept-based differences (actual overall mean differences) in performance on tests “… are smaller than they are believed to be or even nonexistent …” which in turn is consistent with the findings of a number of recent studies that have brought the whole methodology into question.
Can you say “paradigm shift?”
An inadequate history
So called Industrial and Organization Psychology (I/O research, or psychometrics) has a long history, and the last couple of decades of that history has involved two overarching trends: 1) The relative isolation of the field into publication in highly specialized journals with a lot of the research teams very comfortably referencing each other and shutting out external criticism and b) a growing strong belief (and I have carefully selected that word … belief) in the validity of the methods and the accumulated evidence based on those methods.
But it may well turn out that much of this internal self love is based on a poorly assembled house of cards. For example, Aquinis, Culpepper and Pierce document a large number of prior studies that were done with inadequate sample sizes. They point out that Lent, Auerbach, and Levin (1971) found that the median sample size of 406 studies in Personnel Psychology between 1954 and 1969 was 68; Studies from Personnel Psychology between 1950 and 1979 in human resource selection had similarly low sample sizes, according to Monahan and Muchinsky (1983). Dozens of studies published in the journals Journal of Applied Psychology, Journal of Occupational and Organizational Psychology, and Personnel Psychology were similarly flawed (Salgado 1998; Russell et al. 1994).
So that is the breath of the problem. The intensity of the problem is exemplified in a specific case outlined by Aguinis, Culpepper and Pierce. They had a look at a paper by Rotundo and Sackett (1999) in which the authors concluded that “the sample size used in the present study was double the largest tabled value in the Stone-Romero and Anderson article, and the predictor reliabilities were in the .80 to .90 range. . . . We suspect that the power to detect a small effect size in the present study would be reasonably high.” (ibid page 821) It wasn’t. Aguinis, Culpepper and Pierce computed the “statistical power” (using a standard method) of Rodundo and Sackett’s results at 0.101. The usual benchmark for this statistic is to be larger than 0.80.
The point here is simple: Psychometricians can arm wave all they like regarding how many studies have been done, how those studies repeat each other and give similar results, and even how key very well endowed studies (those with large samples) support the broader range of lesser studies. But the arm waving looks rather silly when one looks at the plethora of bad studies using questionable data interpreted optimistically to the extent that one wonders if there is some sort of built in denial. Or worse.
A bad methodology can have victims
And this is not a problem of small magnitude. Aguinis, Culpepper and Pierce provide an approach that would have worked for the touted Rodundo and Sackett’s study:
Statistical power would increase to what is usually considered the acceptable level of .80 if some of the design and measurement characteristics are improved. For example … increasing the sample size in the African American subgroup from 1,212 to 32,000, increasing the sample size in the Whites group from 17,020 to 90,000…
Clearly, we should not be impressed with numbers like “one thousand” when numbers of an order of magnitude more are needed to make the strong and important claims that are often made.
And, the effects of what appears to be systematic inadequacy in the entire field on the humans under study, is astounding. Aguinis et al re-examined he Rodundo and Sackett study to see who would be affected, and how, if the resulting model was used to hire individuals after taking a test. If the biases inherent in the analysis were ignored,
… there would be 20.6% of false negatives in the African American subgroup and 1.42% of false positives in the White subgroup. … about 250 African Americans (out of a total of 1,212) would be denied employment incorrectly and about 242 (out of a total of 17,020) White applicants would be offered employment incorrectly.
Its the slope, stupid. Oh, and the sample size and the underlying assumptions and a few other things …
So, getting it right, and more poignantly, decades of insisting that it has been gotten right when it wasn’t, is highly consequential.
A large portion of the recognized problem has to do with the sample size, relative sample size (between groups) and statistical power. These characteristics of a study are inherent to the testing method itself rather than to the groups being tested. An example that serves as a metaphor (but not as am exact statistical homologue) would be as follows. Suppose you are in a baseball league, and as the final playoffs are approaching. The teams from two cities have a competition to decide which city will host the final games. It is a home-run derby of sorts, where the team that can hit the ball the farthest in an open field wins. But there is a special rule: One batter is allowed in the competition for every 100,000 people living in each city. So, if this was Saint Paul and Minneapolis (the latter is much larger than the former) Saint Paul would have only a few batters while Minneapolis would have many batters. As a result, the playoff games would usually be held in Minneapolis. The reason for this is simple: The chance of getting an outlier — a hit that is exceptionally long or short — is greater with more attempts. Similarly, the outer portions (lower or higher) of an x-y pairing of data will be more extreme (more lower and more higher) in larger samples. And, these more extreme values to affect slope (and slope affects intercept). This example simply illustrates that something as simple as sample size can matter to the outcome of an analysis, in a way that appears to say something meaningful about the underlying population, but where that “meaning” is an artifact, not a reality.
So, as a result, a purely statistical artifact … a feature built into the system of measurement and analysis … appears to have been written off as non existent in psychometrics. But it exists.
In the case of the present study, statistical effects cause the slope of smaller distributions to flatten out compared to those of larger distributions. In any event, the authors suggest that future studies evaluate the statistical power of the tests being used and be more careful about drawing conclusions with inadequate sample sizes. The study also recommends more dramatic shifts in approach:
… we suggest a new approach to human resource selection that examines how members of various groups are affected by testing and also approach testing from a different perspective. [Involving] a change in direction in human resource selection research including an expanded view of the staffing process that considers in situ performance and the role of time and context. In situ performance is the “specification of the broad range of effects–situational, contextual, strategic, and environmental–that may affect individual, team, or organizational performance” [in order] to lead to a better understanding of why and conditions under which “tests often function differently in one ethnic group population than the other”
The tests were biased anyway, but it is worse than previously thought
There is a second set of factors linked to bias in test results, which is well summarized and discussed in the article at hand: The sociohistorical-cultural explanation. I refer you to page 5 of the paper for more details, but briefly, this involves performance differences caused by two factors in minority individuals who would otherwise perform the same as majority individuals: 1) Real differences in ethnically shaped views of what matters for success and 2) performance bias owing to added pressures of being the minority who is required to act as the majority.
For the present, suffice it to say that these effects can also result in biases that have not been properly controlled for, and more specifically, slope differences.
When it comes down to it, our concern is that psychometrics is making a consistent, widespread and damaging Type 1 error: Believing that a certain kind of bias exists and another kind of bias does not, and thus adjusting a certain way inappropriately. The Aguinis, Culpepper and Pierce paper provides a new statistical proof that a widespread mistake in analysis “… can lead to the conclusion that a test is biased in favor of minority group members when in fact such a difference does not exist in the population of scores.”
Aguinis, H., Culpepper, S., & Pierce, C. (2010). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95 (4), 648-680 DOI: 10.1037/a0018714
Lent, R. H., Auerbach, H. A., & Levin, L. S. (1971). Research design and validity assessment. Personnel Psychology, 24, 247-274.
Monahan, C. I., & Muchinsky, P. M. (1983). Three decades of personnel selection research: A state-of-the-art analysis and evaluation. Journal of Occupational Psychology, 56, 215-225.
Salgado, J. F. (1998). Sample size in validity studies of personnel selection. Journal of Occupational and Organizational Psychology, 71, 161-164.