It is a guiding principle in test development that stimulus materials and test questions should not upset test-takers. Much like dinner conversation with in-laws, tests should refrain from referencing religion, or sex, or race, or politics — anything that might provoke a heightened emotional response that could interfere with test-takers’ ability to give their best effort.
Attention to “sensitivity” concerns, as they’re known, makes sense conceptually. But in practice, as they shape actual test development, sensitivity concerns are responsible for much of why conventional standardized tests are so ridiculously bland and unengaging. The drive to avoid potentially sensitive content constrains test developers to such a degree that one might legitimately question whether the cure is at least as bad as the disease.
So determined are test-makers to avoid triggering unwanted test-taker emotions, they end up compromising the validity of their tests by excluding essential educational content and restricting students’ opportunities to demonstrate the creative and critical thinking skills they’re actually capable of. In other words, ironically, conventional standardized tests may be so radically boring that they’re no better at measuring actual ability and achievement than if they regularly froze test-takers solid with depictions of graphic horror.
Actually, no one knows for certain if the tests are better or worse for being so cautious. There is no research defining sensitivity, no evidence-based catalog of topics to avoid, no study measuring the test-taking effects of “sensitive” content. For all anyone knows, inflaming emotions might actually improve test results — though few test-makers would risk experimenting to find out.
No test-maker wants to hear from a teacher or parent that a student was stunned, enraged, offended, or even mildly disconcerted by content they encountered on a test. And in fairness, no test-maker wants to subject a test-taking kid to a hurtful or upsetting experience. They’re captives, after all; if something on the test makes them feel crappy, they have little choice but to sit there and absorb it. Their scores may or may not reflect the fact that their emotions were triggered: there’s really no way to tell.
On the other hand, high-stakes standardized tests, in and of themselves, trigger lots of negative emotions in plenty of kids, regardless of question content. So a cynic might wonder how much sensitivity concerns are driven by concern for kids’ experience, and how much by fear of the PR nightmares that would ensue from a question or passage that someone could claim was racially or religiously offensive. Whatever the case, the result is the same: keep it safe by keeping it bland.
Since there is no research to guide decisions on sensitivity, the rules test-makers set for themselves are based strictly on their own judgment, and on some sense of industry practice. Inevitably they default to the most conservative positions possible: if a topic >might conceivably be construed as sensitive, that’s enough reason to keep it off the test.
Typically, sensitivity guidelines steer test developers away from content focused on age, disability, gender, race, ethnicity, or sexual orientation. Test-makers also avoid subjects they deem inherently combustible, such as drugs and drinking, death and disease, religion and the occult, sexuality, current politics, race relations, and violence.
A “bias review” process gets applied in the course of developing passages and questions for testing, to weed out anything that might be offensive or unfair to certain subgroups — typically African Americans, Asian Americans, Latinos, Women, sometimes Native Americans. The test-maker will send prospective test materials out for review by qualified educators who belong to these subgroups. If a reviewer thinks a test item is problematic, it gets tossed. Though this process is better than nothing, it reflects more butt-covering than enlightenment, putting test-maker and reviewer alike in the awkward position of saying, for instance, “These test items are not unfair to Black people. How do we know? We had a Black person look at them!”
Judgments on topics not pertaining to identity and cultural difference rest purely on the test makers, who, as mentioned, are as risk-averse as can be. In one example I’m familiar with, a passage about the mythological Greek figure Eurydice was rejected because the story deals with death and the underworld. Think of all the literature and art excluded from testing on that kind of criteria. Think of the impoverished portrait of human achievement and lived experience conveyed to students by such an exclusion.
In another case, a passage on ants was rejected because it reported that males get booted out of the colony and die shortly after mating. I’m still not clear on whether the basis for that judgment centered on the reference to insects mating, insects dying, or the prospect of a student projecting insect gender relations onto human relations and being thereby too disturbed to think clearly. Whatever the case, rejecting such a passage on the basis of sensitivity concerns seems downright anti-science.
As does the elimination of references to hurricanes and floods because some kids might have experienced them. I remember a wonderful literary passage that depicted a kid watching his family’s possessions float around the basement when their neighborhood flooded. It was intended for high schoolers. It got the noose.
I’ve seen a pair of passages from Booker T. Washington and W. E. B. DuBois nixed out of concern for racial sensitivity: you can’t have African Americans arguing with each other on questions of race. Test-makers strive to include people of color in their test content to satisfy requirements for cultural inclusivity. But those people of color cannot be engaged in the experience of being people of color — which renders the whole impulse toward inclusivity hollow and cynical. Such an over-abundance of caution does more to protect the test-maker than the student.
The content validity of educational assessments that cannot reference slavery, evolution, extreme weather events, natural life cycles, economic inequality, illness, and other such potentially sensitive topics should come under serious interrogation. More concerning still is the prospect of such tests driving curriculum. With school funding and teacher accountability riding on standardized test scores, teaching to the test makes irresistibly practical sense in many educational contexts. Thus, if the tests avoid great swaths of history, science, and literature, then so will curriculum.
The makers of the standardized tests schoolkids encounter argue that they are not interested in censoring educational content, only in recognizing that when students encounter potentially sensitive topics they need the presence of an adult to guide them through. The classroom and the dinner table are places for negotiating challenging subjects, not the testing environment, where kids are under pressure and on their own.
This rationale should rouse everyone to question why we continue to tolerate such artificial conditions for evaluating student learning. It essentially concedes that testing doesn’t align with curriculum, that kids will not be assessed on the things they’re taught — only on the things test-makers decide are safe enough to put in front of them. Further, it admits that test-makers compromise the content validity of their tests in deference to the highly contrived testing conditions they require. Surely we can recognize in this the severe design flaws that lie at the heart of the testing problem.
Obviously, insulting or traumatizing students with test content is something to be avoided. But at the same time, studies show that test-taker engagement is essential for eliciting the kinds of performances that accurately reflect students’ capabilities. When tasks lack relevance and authenticity they work against students’ ability to demonstrate their best work, especially students from underserved populations. Consider this statement:
Engagement is strongly related to student performance on assessment tasks, especially for students who have been typically less advantaged in school settings (e.g. English Language Learners, students of historically marginalized backgrounds) (Arbuthnot, 2011; Darling-Hammond et al., 2008; Walkington, 2013). In the traditional assessment paradigm, however, engagement has not been a goal of testing, and concerns about equity have focused on issues of bias and accessibility. A common tactic to avoid bias has been to create highly decontextualized items. Unfortunately, this has come at the cost of decreasing students’ opportunities to create meaning in the task as well as their motivation to cognitively invest in the task, thereby undermining students’ opportunities to adequately demonstrate their knowledge and skills.
In my own experience interviewing high schoolers about writing prompts, they want to write about Mexican rappers, violence in videogames, representations of gender and race in popular culture, football concussions, gun ownership, the double-standard dress codes schools impose on girls compared with boys, and other topics that are both authentic and relevant to them. Conventional standardized tests would not come near topics like these.
Any solution to this problem has to entail breaking away from the dominant, procrustean model of standardized test-taking, which isolates individual students from all resources and people, asks them to think and write on topics they may never have encountered before and care nothing about, and confines them to a timeframe that reflects the practical considerations of the test-maker, not the nature of authentic intellectual work.
Once free of the absurdly contrived conditions of conventional test-taking, sensitivity concerns can be removed from the domain of test-makers worried about their own liability. Instead, along with their teachers and guardians, students can decide what topics are appropriate to grapple with in their academic work. In fact, learning to choose, scope, and frame a topic in ways appropriate for an academic project is itself an essential skill, worthy of teaching and assessing.
Arbuthnot, K. (2011). Filling in the blanks: Understanding standardized testing and the Black-White achievement gap. Charlotte, NC: Information Age Publishing.
Darling-Hammond, L., Barron, B., Pearson, P. D., Schoenfeld, A. H., Stage, E. K., Zimmerman, T. D., … & Tilson, J. L. (2015). Powerful learning: What we know about teaching for understanding. John Wiley & Sons.
Walkington, C. A. (2013). Using adaptive learning technologies to personalize instruction to student interests: The impact of relevant contexts on performance and learning outcomes. Journal of Educational Psychology, 105(4), 932.