THE ASSESSMENT DEBATE

by Jeffrey M. Jones, M.D..,Ph.D.

Nearly all parents have encountered standardized tests many times during their lives. They may have taken the Iowa Test of Basic Skills as grade schoolers and similar tests in middle school or high school. They may have taken the SAT or ACT tests when they applied for admission to college or the MCAT, LCAT, or Graduate Record Examination when applying for a graduate school. They may have taken standardized tests for professional licensure or certification or for other career advancement purposes. We often do not remember these tests fondly. Somehow it didn't seem right to have our efforts as students reduced to a 4 to 8 hour test in which we had to choose correct answers and record them with our number 2 pencils in the bubbles of an answer sheet prepared for machine grading. Some of us may say, "I just don't test well on standardized tests." Before accepting a statement such as this at face value, we need to ask, "Is there truly a better alternative to standardized testing?" When we have taken standardized tests, it has been for a high-stakes purpose (promotion, placement, certification, etc.). Whether or not we realized it, our test results and the results of other students were often being used to determine the effectiveness of the educational programs in which we had participated. And whether or not we like to admit it, those standardized tests made us uncomfortable because we knew each item had only one correct answer. We either knew the necessary facts or concepts and had the skills needed to answer each test item correctly or we did not. These were not essay examinations where we could write down a page or two of prose with the hope that we stumbled on an acceptable answer. An essay test is an example of a performance assessment. Would performance assessments have been preferable to those sometimes frustrating standardized tests? Before answering this question we need to ask ourselves another question, "If I had taken essay tests rather than standardized tests for high-stakes purposes, would I be complaining that the test preparers happened to choose the four subjects about which I didn't know much or that the grading of these tests was subjective and unreliable?"

Over the years, psychometricians (professional test preparers) have learned a lot about how to create standardized tests that truly measure the knowledge and skills they claim they are measuring. In other words these tests are valid. These standardized tests are also reliable. This implies for example, that the same person taking a different form of the test will tend to make close to the same score (please see Definitions for more information about reliability and validity). Both validity and reliability are the key components of testing fairness. Any test is a sampling device. The main reason that a good objective, standardized test is fairer than a good performance test is that, for a test of reasonable length, a good standardized test can probe a much larger and more representative sample of the subject (domain) being sampled. If a domain is a skill, the objective test will be able to sample a larger and more representative variety of factors that make up the skill. A performance test of writing, for example may only cover one or two genres (kinds) of writing (e.g. narrative, persuasive), whereas an objective test can sample five or six. If a content (knowledge) domain is being tested, then a good standardized test can sample a bigger portion of the whole domain than can a performance test which takes the same amount of time to complete. For example, a performance test on science could consume an hour testing the ability of a student to produce electric circuits. In the same period of time, a multiple-choice test could sample student in this area as well as student knowledge about atomic structure, electrons, magnetism, capacitance, resistance, voltage, power, etc.

Before discussing the use of different assessment formats for high-stakes purposes, we should pause to distinguish between use of assessments as a teaching tool and assessments given for a high-stakes purpose. Teachers certainly may use multiple-choice, short answer, true-or-false, or matching tests in everyday classroom practice. Probably the most common assessment used by teachers to determine how well students are mastering material is the oral communication assessment. Simply put, the teacher asks a student or a group of students to answer a question aloud, and uses the response to gauge how well material is understood. Students may be asked to give more extended oral presentations on particular subjects as well. Also, the teacher may have the students complete essays, posters, term papers, etc. to enhance their learning. These performance assessments have been used in the classroom for years. Their usage in the classroom differs markedly from their usage for a high-stakes purpose in several important ways. First, the classroom teacher realizes that students can have "bad days." In other words the teacher realizes that the performance assessments may have poor score reliability. Second, a number of performances can be done over a period of several months. The teacher will not count these as the only component in the student's grade. The performance assessments are used to supplement more objective tests which determine in a more detailed way what the student has learned. In other words, the classroom teacher can take into account the fact that performance tests may lack generalizability. Third, the student generally gets almost immediate feedback about how he or she did on the particular performance assessment, and the student can be given detailed suggestions for improvement, which are concretely tied to the teacher's evaluation of his products. This cannot occur when a performance assessment is given for high-stakes purposes. Thus, we do not want to give the impression that we object to the use of performance assessments in the classroom. We think that their use should flow naturally from the content and skills to be taught. They should be part of a balanced assessment system in the classroom. But there is a real danger that an excessive emphasis on use of performance assessments will lead to classroom activities associated with this testing format becoming the only way in which classroom teaching occurs. If this happens, the quantity and quality of content and skills taught are both likely to suffer. It is important for us to understand that the reasons for giving high-stakes assessments and the circumstances in which they are given are fundamentally different from the reasons that tests are given in the classroom. First, it is certainly true that the grades on classroom assessments determine a student's grade for a particular subject. Viewed this way, classroom assessments can be high-stakes tests. But it is clear that classroom testing has a very important instructional goal. On the other hand, the high-stakes assessments given for college admission, professional certification, etc. differ from classroom assessments because they really have no instructional purpose. They are given solely to gauge a student's achievement. A student may do very poorly on a high-stakes assessment, and this may lead the student to make plans for further study prior to retaking the examination; however, the overriding purpose of the high-stakes test is to compare his achievement to an expected standard or norm. Second, the testing environment for a high-stakes assessment is different from the classroom. A teacher can monitor the achievement of a student by testing many times over a period of months. This is not the situation for high-stakes assessments. A student has only a limited time to complete the high-stakes assessment, which is typically given over a period of a two hours to 2 days. Thus, it becomes critical that the high-stakes test instrument used gives reliable scores and fairly samples what a student knows. The student gets no "second chance" to show what he knows.

Earlier we said standardized tests have been developed to assure they are reliable and valid assessments when used for a high-stakes purpose. Even though it may not be obvious to us, a good deal of research has gone into the development of many of these standardized tests. Let's take a quick trip through the history of the development of high-stakes standardized tests using the assessment of writing ability as an example. In the early 1900s, the English Composition Test given for college entry by the College Board consisted of an extended essay. A researcher named Hopkins showed that the score a student made on the examination might well depend more on which year he appeared for the examination, or on which person read his paper, than it would on what he had written. This certainly did not seem fair at all, and it was this observation that led to research on what kind of essay test would give reliable results and ultimately, to the development of standardized, objective test items that could be used to assess writing skill. Researchers looked at high school English grades and systematic ratings of student writing ability by teachers who knew students well. They then tested students with essay examinations and looked at how well the scores obtained correlated with the classroom assessments of writing ability. They also had a number of skilled raters read writing samples from high school students to determine how well they could get different raters to agree in their scoring of student work. From these data, the researchers determined the breadth of writing samples and the number of writing samples required to give reliable scores of writing quality. For example, to get a highly reliable score of writing ability (the same students having the same score on different forms of the test 90% of the time), studies of Diederich and others (1961) and Anderson (1960) indicated that it would be necessary to have students complete essays on at least five carefully chosen topics and have each writing scored by at least five skilled raters. Researchers reasoned that such a performance assessment of writing would require very long testing periods, be very expensive to score, and that quality control of the scoring would become difficult as one tried to assess thousands of students. Therefore, in the 1960s, Godshalk and others decided to build on these observations and determine if objective, standardized questions could be prepared that would predict how well a student would score on essay tests similar to those described by Diederich and Anderson. They discovered multiple-choice test items which yielded test scores that were quite comparable to scores obtained by having students complete an essay examination requiring several hours. In fact, they ended up concluding that a multiple-choice test that took 20 minutes to complete gave the same information as an essay examination which required two full and three half classroom periods (240 minutes of testing time). The multiple-choice test could be machine scored in less than a second whereas the collection of essays written by each student took over two hours for the raters to score. Also, adding a 20 minute essay to the multiple-choice test did add some to the ability to predict writing skill; however, grading the essay substantially increased the cost of the test. Any way you look at their data, the standardized test with or without a short essay, was much more economical of student time and scoring costs than an extended essay performance assessment. But the standardized test was still a very reliable and valid predictor of student performance. It is worthwhile looking at the three types of test items which the Godshalk study proved were highly predictive of a student's writing ability:

I. Usage (grammar, diction, basic structure, or mechanics) is to be
    corrected, or if correct, left standing.

Example:
  He spoke bluntly and angrily to we spectators.
	(A) bluntly
	(B) angrily
	(C) we
	(D) spectators
	(E) no error		Correct answer: (C)


II. Sentence correction items require selection of a best form for an 
    underlined portion of a sentence structure. A sentence is to be
    corrected, improved, or if reasonable, let stand.

Example:
  While waving good-by to our friends, the airplane took off, and we
  watched it disappear in the sky.
	(A) While waving
	(B) Upon waving
	(C) Having waved
	(D) Waving
	(E) While we waved		Correct answer: (E)


III. The construction shift item requires the student to decide what
      additional changes to make in a sentence if a specified element
      is changed a certain way.
	
Example:
  Statements such as "this picture is trash" or "the outlook is dark"
  or "this steak is wonderful" are statements not only about the picture,
  the outlook, or the steak but also about the speaker's reaction to them.
 
   Substitute give less information for are statements not only. Your
   rewritten sentence will contain which of the following:
	(A) but about
	(B) as about
	(C) than about
	(D) than the
	(E) and more about		Correct answer: (C)
At first glance, these test items are not very appealing. One wonders what these questions have to do with writing skill. These test items seem to be sampling editing skill rather than writing ability. An essay it would seem, ought to really test the ability of a student to get words down on paper. But it turns out that this creative, constructive side of writing is not well sampled by an essay (performance) test. And quality of writing is much more dependent upon editing ability than on facility in just getting words down on paper. Very good writers edit in their minds before they write. Excellence in writing and in editing are inseparable. Thus, the standardized test items end up assessing writing ability quite well, are efficient of testing time, and economical (answers are graded by machine). Having looked briefly into the history of the development of standardized tests to assess writing skill, one might wonder if a demand to replace a standardized with a prolonged essay performance assessment isn't a regression to the 1920s and 1940s, when less reliable essay tests were the only ones available.

There are, of course, a variety of criticisms that have been tendered against standardized tests. Some of these can be true, but we believe they are objections which can be overcome. Standardized multiple-choice tests can be superficial. But when one looks at high-stakes tests which have been developed for certifying medical professionals, it is clear that these multiple-choice examinations can test highly complex knowledge bases and their application. Also, teachers can teach to the standardized test. However, this criticism can be directed in truth against any testing scheme. Teaching to the test can be remedied in a straight forward way. Various forms of a standardized test can be prepared--most quality commercial testing companies have various forms of a standardized test available. A district can choose which form of the test is given at the last possible moment. It would be difficult if not impossible to outsmart such a system, since to teach the right answers to all the many forms of the test would be more or less the same as teaching the whole domain. In this situation, a teacher might as well teach without worrying about possible test questions on the test. The best practice for teachers would be to simply teach their subject as effectively as they can.

There are organized groups of educationists, such as the Cambridge-based group FairTest, who would like to throw out standardized tests completely, replacing them with performance assessments. Failing that, they would like to have much more testing time spent on performance assessments. Of course, if more time is spent on performance assessments and much less on standardized testing, and the total testing time remains constant, fewer objective test items can be used. The score reliability of the standardized portion of such a hybrid test would suffer. Thus, the end result would be a standardized test with poor reliability given along side a performance test which also had poor reliability. We would get a testing system that would lose its ability to provide reliable after-the-fact accountability. If test results are unreliable, the achievement of groups of students in successive years or in different schools or school districts could not be measured or compared. This would be a very convenient situation for educationists involved in the operation of a public school system, which amounts to a virtual monopoly in the K-12 school years. They could always claim that students are doing well, and there would be no way to test the truth of their assertions.

Why do we see standardized tests under constant attack today? We think there are two reasons. First, this is a scapegoating exercise which attempts to shoot the messenger who brings bad news. If machine-scored standardized tests were bringing less discouraging news, or if educationists could produce workable ideas about how to make their results better, educationists would not be so set in attacking standardized tests. If our children's standardized test scores were much higher, or if the spread of scores were more evenly distributed by class, race, and gender, or if American students compared more favorably in international rankings, the constant assault on standardized testing would vanish. Second, there is an assessment industry, which has grown large and powerful over the past 30 years. The much greater expense associated with the creation, administration, and scoring or performance assessments is appealing to this industry. Like any other special interest, it will lobby to obtain funding from the deepest pockets around, namely, the state and federal governments.

(Return to Topics Overview)


References

Anderson, C.C. "The new STEP essay test as a measure of composition ability." Educational and Psychological Measurement, Spring 1960, pp. 95-102.

Diederich, P.B., French, J.W., and Carlton, S.T. "Factors in judgments of writing ability." Research Bulletin RB-61-15. Princeton, N.J.: Educational Testing Service, 60 pp.

Godshalk, F.I., Swineford, F, Coffman, W. and the Educational Testing Service. 1966. The Measurement of Writing Ability. New York: College Entrance Examination Board, 1966, 84 pp.

Hopkins, L.T. "The marking system of the College Entrance Examination Board." Harvard Monographs in Education, Series 1, No. 2. Cambridge, Mass.: The Graduate School of Education, Harvard University, October, 1921, 15 pp.


P.R.E.S.S., P.O. Box 26913, Milwaukee, WI 53226
Phone: (414) 453-8116, FAX: (414) 453-9442, E-mail: presswis@execpc.com
http://www.execpc.com/~presswis/


Return to PRESS Home Page