Jan 1, 1995 12:00 PM

This Test Is for You

Standardized testing is a communal rite of passage. Computer-adaptive testing is about to make those rites very individual.

Standardized testing is a communal rite of passage. Computer-adaptive testing is about to make those rites very individual.

Ditch that Number 2 pencil. Erase those little bitty bubbles right out of your memory. Don't fret about picking only one of three days in the year for that tension-filled 7:30 a.m. drive to the testing center. Change is afoot. High-tech innovations in computer-adaptive testing, or CAT, are forcing educators to reconsider the way standardized tests are administered and evaluated.

Using mathematical formulas that have been part of statistical theory for decades, CAT programs make snap judgments about how smart you are according to how you answer each question the programs decide to give you. Imagine this: you're sitting down to take one of any number of standardized tests, be it a well-known exam such as the SAT or the GRE, or a more specialized test you might take as a firefighter trying to prove your mettle. CAT software presents an item of average or typical difficulty. Based on your response, the next item is either easier or harder. An incorrect response tells the program that the last item was too hard for you and that perhaps the your true ability level is lower; a correct response means that your ability requires something more challenging.

Now, the question of whether any timed battery of multiple-choice questions can predict anything useful about human intelligence is still open. If you're the sort that believes all standardized testing is a con job - some sort of pseudo-scientific shuck to unfairly exclude certain people from the colleges and careers they'd like to enter - then the current movement to computerize these tests won't impress you. But both your job and the education your children receive are largely determined by performance on standardized tests, so you can't escape the spectre of this kind of testing. Nor can you escape the spectre of CAT.

So what, precisely, happens when you bid farewell to those bored proctors and toss your Number 2s? Once approved to take an exam, individual applicants are mailed a toll-free telephone number to schedule a testing appointment at their convenience. Say you're an aspiring nurse. On the chosen day, you arrive at one of several local testing centers, are assigned to a private terminal, and take a nursing exam. Data you input is transferred via modem from the center to the central hub of the computer outfit and from there to the mainframe of the testing company overseeing the examination. To ensure accuracy, supervisory organizations then confirm results through a re-application of the test "key" before final verification of pass/fail status is sent electronically to the state nurses' licensing or certification board. Through CAT, all of this is possible in a few days; by contrast, traditional paper and pencil testing still requires a one- to two-month turnaround before giving you your results.

Besides the obvious convenience and the fact that electronic delivery, storage, and instant scoring of computerized tests should eventually render them cheaper to administrate, computer-adapted tests have other subtler advantages. For instance, computing capabilities make it possible, in the jargon of teachers and testers, to "self-adapt" exams. In the winter of 1992, the Journal of Educational Measurement published a study comparing results from self-adaptive and computer-adaptive tests. Using MicroCAT, testers had developed an item bank of math problems comprising six levels of difficulty. In the self-adaptive version of the quiz, students were given the chance to select their own level of difficulty on each question, rather than being randomly judged based on the previous answer. This option resulted in higher overall scores for the students taking the self-adaptive version. Proponents argue that such flexibility allows CAT to get closer to measuring skills and intelligence, rather than some abstraction of them.

In many cases, it's almost impossible to measure ability without using performance-based and simulated assessments rather than the old standby, multiple choice. One example of a simulation exam is the Clinical Competency Test in Veterinary Medicine, developed by the Professional Examination Service. Here, aspiring vets are given various scenarios in which they have to work through what they would do in each hypothetical situation. Examinees choose from a list of procedures that produce appropriate consequences. Say the patient du jour is an ailing gerbil. The examinee is presented with symptoms that increase or decrease in severity according to what treatment is prescribed. Obviously, the objective would be to quickly initiate the correct treatment before the virtual gerbil croaks.

In simulation testing for public safety personnel (firefighters, police), the objective is similar: to present real-life crisis situations in a multidimensional manner to elicit responses that reflect how each job candidate would react "in the moment." Some of these multimedia exercises owe their basic structure to military prototypes that have finally trickled down from élite training programs. Simulations, of course, are something computers can expedite quite well. Some simulation tests allow responses in essay form, in which the computers only assist human raters in measuring the creativity and viability of a solution. But often, computers can directly speed up and streamline authentic assessments in ways that ultimately render unidimensional, multiple-choice models obsolete.

At its headquarters in Minnesota, Assessment Systems Corporation has been pioneering standardized testing software for 15 years. The company had to wait awhile for the size and cost of the hardware to come down to more consumer-friendly proportions, but now it is perfectly positioned to capitalize on the boom in microcomputer-assisted mental measurement. Its ads in trade magazines trumpet: "The dream is alive! Computerized testing is being implemented right now!" And indeed, what the company has come up with is the ultimate in do-it-yourself testing software, allowing a kindergarten teacher, a corporate president, or any ad hoc community group to develop customized, state-of-the-art, standardized exams. Just fill the question "bank" with suitable items - taking into consideration the sex, culture, and principal language of the people being tested - and off you go.

By demystifying the creation, calibration, and scoring of tests like these, Assessment Systems Corporation indirectly takes a stride toward liberating society from the unidimensional standard that has ruled the testing industry for decades. It has put the same tools used by rich, ivory-tower evaluation services within reach of average citizens. Free-market competition and entrepreneurial savvy can do the rest, eventually destroying the hegemony of over-simplified multiple-choice templates. Champions of testing reform, who've spent years telling people that sacred cows like the SAT are neither sacrosanct nor infallible, might now be taken more seriously.

But all the hype has not eliminated the raging controversy that has surrounded the testing world for years. The very seductiveness of all this added convenience obscures certain ethical drawbacks of using CAT to further popularize standardized testing, and it diverts attention from problems inherent in CAT itself. One debate now in high gear concerns biases in testing and the need for disclosure and evaluation of test questions.

A group making much noise in this discussion is FairTest, based in Cambridge, Massachusetts. The team works with educators, policy makers, parents, and teachers to advocate reform and public accountability in national testing protocol. FairTest's Public Education Director, Bob Schaeffer, is a sharp, affable, politicized zealot with a degree from MIT. Schaeffer is against "bad tests in pretty new packaging," tests that have been greatly improved simply through the addition of computer technology. To him, a toad in a golden cage is still a toad. "The first major use of mental measurement in this country was in World War I, when the Army Alpha Test was purportedly used to assign jobs to soldiers," Schaeffer explains ruefully. "Thousands of people were labeled morons or worse because they couldn't answer questions like: A puck is used in the following game: (a)tennis, (b)football, (c)hockey, or (d) golf. The questions were clearly only measuring how familiar a person was with American culture of that time, not how intelligent that person was." Schaeffer would rather the US not recreate the kind of policies for immigration and job placement that stemmed from the suspect findings of the Alpha Tests.

If Bob Schaeffer represents a more critical perspective on the motives and methods of standardized testing, John Katzman, president of The Princeton Review, one of the largest commercial test-coaching companies, takes a more pragmatic stand on the business both he and the test makers are in.

"We run courses for high-school and college kids preparing for tests for colleges and grad schools - mostly the SAT, LSAT, GMAT, GRE. We seat about 60,000 students a year," says Katzman, in the breezy, staccato style that reflects the rapport he feels with the ambitious youngsters he helps by devising coaching materials. "Our interest focuses on the openness and the fairness of tests. It's not a theoretical exercise for us. It's, What do we do with this?! So when the CAT for the GRE first came out in the early '90s, we realized that this is the way the world is going, and we wrote our own software to check it out."

Katzman testified last year before the New York Senate Higher Education Committee, presenting his ideas on how CAT methodology could be changed to satisfy both the testing companies' need for cost-effective content security and FairTest's need to check item banks for faulty and biased questions.

"I don't think that CATs are a bad idea," Katzman is careful to point out. "There are a lot of advantages. You can take them almost any time, you can get your score back immediately, those are all great things. The main problem is disclosure."

It's always been the position of Educational Testing Service, the US's largest maker of multiple-choice exams, that you can't use a test once you have disclosed it. So Katzman suggests that, particularly for the adaptive GREs and the inevitable SAT conversion, testing services amplify the available item bank to include all the thousands of previously disclosed and approved questions. If the computer is selecting items from a bank of 10,000 rather than from just 100 or so brand-new items, it should be impossible for kids to memorize and share answers to any appreciable extent.

Further problems with computer-adaptive tests derive from the structure of the programming. Whereas traditional tests used 200 equally weighted, multiple-choice items, the first few "adaptive" questions on these shorter interactive models are intrinsically worth more toward a final grade than items that appear further on. In short, getting the first three or four questions right on a CAT usually bumps you up to a range of difficulty that places you on the "smart" end of the scale, from which it's hard to fall too far, no matter what you do. Getting the first few key items wrong, however, may drop you from "smart" (in the computer's estimation) to "average," and it's a big struggle to recover.

This is the quirk in CAT that bothers the coaching centers. After paying big bucks to figure the angles that allow their students to beat statistical odds, CAT methodology is making coaching centers' jobs a little tougher. Coaching centers teach you to answer questions on traditional standardized tests, but the game becomes significantly harder when computer-testing software branches out and adapts, testing each student differently with its database of thousands of questions.

Though riddled with ethical and other troubles, standardized tests remain a ritual practice throughout the Western world. The Academy and perhaps even society in general love to mystify and fetishize the testing process. Ultimately, this may be the biggest obstacle to change how we view and measure human achievement. The hundred million standardized tests given in the US each year is a rite of passage. As children, we learn to fear and worship the instruments of mental measurement, and a good part of our self-esteem centers around how the kind, quality, and quantity of our knowledge is judged. Addicted to the competitive egoism which results from ranking others as better or worse off than ourselves, we've learned to indulge that somewhat ugly urge in the sainted name of science.

Still, despite its limitations, CAT proponents insist that adaptive testing isn't only a method for speedier test tabulation. The hope that CAT holds, its champions claim, is a future where the testing process our society seems so drawn to will more accurately assess the width and breadth of a student's knowledge. By allowing for differences, CAT more clearly reflects the reality of how - in a world of infinite facts - different people can have different knowledge sets and varying thinking structures, and still be accurately judged as competent.