Measuring What Counts: A Conceptual Guide for Mathematics Assessment (1993)

National Academies Press: OpenBook

Chapter: 6 Evaluating Mathematics Assessment

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

6
EVALUATING MATHEMATICS ASSESSMENTS

Whether a mathematics assessment comprises a system of examinations or only a single task, it should be evaluated against the educational principles of content, learning, and equity. At first glance, these educational principles may seem to be at odds with traditional technical and practical principles that have been used to evaluate the merits of tests and other assessments. In recent years, however, the measurement community has been moving toward a view of assessment that is not antithetical to the positions espoused in this volume. Rather than view the principles of content, learning, and equity as a radical break from past psychometric tradition, it is more accurate to view them as gradually evolving from earlier ideas.

Issues of how to evaluate educational assessments have often been discussed under the heading of "validity theory." Validity has been characterized as "an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment." 1 In other words, an assessment is not valid in and of itself; its validity depends on how it is interpreted and used. Validity is a judgment based on evidence from the assessment and on some rationale for making decisions using that evidence.

Validity is the keystone in the evaluation of an assessment. Unfortunately, it has sometimes been swept aside by other technical matters, such as reliability and objectivity. Often it has been thought of in narrow terms ("Does this assessment rank students in the same way as another one that people consider accurate?"). Today, validity is being reconceived more broadly and given greater emphasis in discussions of assessment. 2 Under this broader conception,

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

validity theory can provide much of the technical machinery for determining whether the educational principles are met by a mathematics assessment. One can create a rough correspondence between the content principle and content validity, 3 between the learning principle and consequential or systemic validity, 4 and between the equity principle and criteria of fairness and accessibility that have been addressed by Silver and Lane. 5

Although every mathematics assessment should meet the three principles of content, learning, and equity, that alone cannot guarantee a high-quality assessment. Technical considerations, including generalizability, evidence, and costs, still have a place. The educational principles are primary and essential but they are not sufficient.

THE CONTENT PRINCIPLE

The contexts in which assessment tasks are administered and the interpretations students make of them are critical in judging the significance of the content.

What is the mathematical content of the assessment?

What mathematical processes are involved in responding?

Applying the content principle to a mathematics assessment means judging how well it reflects the mathematics that is most important for students to learn. The judgments are similar to early notions of content validity that were limited to asking about the representativeness and relevance of test content. The difference lies in a greater concern today for the quality of the mathematics reflected in the assessment tasks and in the responses to them.

Procedures for evaluating the appropriateness of assessment content are well developed and widely used. Most rely heavily on expert judgment. Judges are asked how well the design of the assessment as a whole captures the content to be measured and how well the individual tasks reflect the design. The two sets of judgments determine whether the tasks sufficiently represent the intended content.

New issues arise when the content principle is applied:

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

CONTENT OF TASKS

Because mathematics has been stereotyped as cut and dried, some assessment designers have assumed that creating high-quality mathematics tasks is simple and straightforward. That assumption is false. Because mathematics relies on precise reasoning, errors easily creep into the words, figures, and symbols in which assessment tasks are expressed.

Open-ended tasks can be especially difficult to design and administer because there are so many ways in which they can misrepresent what students know and can do with mathematics. 6 Students may give a minimal response that is correct but that fails to show the depth of their mathematical knowledge. They may be confused about what constitutes an adequate answer, or they may simply be reluctant to produce more than a single answer when multiple answers are called for. In an internal assessment constructed by a teacher, the administration and scoring can be adapted to take account of misunderstanding and confusion. In an external assessment, such adjustments are more difficult to make. The contexts in which assessment tasks are administered and the interpretations students are making of them are critical in judging the significance of the content.

The Ironing Board

The diagram shows the side of an ironing board.

The two legs cross at x°

  1. Use the information in the diagram to calculate the angle x°. Give your answer to the nearest degree.
  2. Calculate the value of l.

Difficulties arise when attempts are made to put mathematics into realistic settings. The setting may be so unfamiliar that students cannot see mathematics in it. Or, the designer of the task may have strained too hard to make the mathematics applicable, ending up with an artificial reality, as in the example above. 7 As a practical matter, the angle between

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

the legs of the ironing board is not nearly so important as the height of the board. As Swan notes, 8 the mathematical content is not incorrect, but mathematics is being misused in this task. A task designer who wants to claim the situation is realistic should pose a genuine question: Where should the stops be put under the board so that it will be convenient for people of different heights?

The thinking processes students are expected to use are as important as the content of the assessment tasks.

The thinking processes students are expected to use in an assessment are as important as the content of the tasks. The process dimension of mathematics has not merited sufficient attention in evaluations of traditional multiple-choice tests. The key issue is whether the assessment tasks actually call for students to use the kind of intellectual processes required to demonstrate mathematical power: reasoning, problem solving, communicating, making connections, and so on. This kind of judgment becomes especially important as interesting tasks are developed that may have the veneer of mathematics but can be completed without students' ever engaging in serious mathematical thinking.

To judge the adequacy of the thinking processes used in an assessment requires methods of analyzing tasks to reflect the steps that contribute to successful performance. Researchers at the Learning Research and Development Center (LRDC) at the University of Pittsburgh and the Center for Research, Evaluation, Standards, and Student Testing (CRESST) at the University of California at Los Angeles are beginning to explore techniques for identifying the cognitive requirements of performance tasks and other kinds of open-ended assessments in hands-on science and in history. 9

Mixing Paint

To paint a bathroom, a painter needs 2 gallons of light blue paint mixed in a proportion of 4 parts white to 3 parts blue. From a previous job, she has I gallon of a darker blue paint mixed in the proportion of I part white to 2 parts blue. How many quarts of white paint and how many quarts of blue paint (I gallon = 4 quarts) must the painter buy to be able to mix the old and the new paint together to achieve the desired shade? How much white paint must be added and how much blue paint?

Discuss in detail how to model this problem, and then use your model to solve it.

The analysis of task demands, however, is not sufficient. The question of what processes students actually use in tackling the tasks must also be addressed. For example, could a particular problem designed to assess proportional reasoning be solved satisfactorily by using less sophisticated operations and knowledge? A problem on mixing paint, described at left, was written by a mathematics teacher to get at high-level understanding of proportions and to be approachable in a variety of ways. Does it measure what was intended?

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

Such questions can be answered by having experts in mathematics education and in cognitive science review tasks and evaluate student responses to provide information about the cognitive processes used. (In the mixing paint example, there are solutions to the problem that involve computation with complicated fractions more than proportional reasoning, so that a student who finds a solution has not necessarily used the cognitive processes that were intended by the task developer.) Students' responses to the task, including what they say when they think aloud as they work, can suggest what those processes might be. Students can be given part of a task to work on, and their reactions can be used to construct a picture of their thinking on the task. Students also can be interviewed after an assessment to detect what they were thinking as they worked on it. Their written work and videotapes of their activity can be used to prompt their recollections.

None of these approaches alone can convey a complete picture of the student's internal processes, but together they can help clarify the extent to which an assessment taps the kinds of mathematical thinking that designers have targeted with various tasks. Researchers are beginning to examine the structure of complex performance assessments in mathematics, but few studies have appeared so far in which labor-intensive tasks such as projects and investigations are used. Researchers at LRDC, CRESST, and elsewhere are working to develop guidelines for gauging whether appropriate cognitive skills are being engaged by an assessment task.

Innovative assessment tasks are often assumed to make greater cognitive demands on students than traditional test items do. Because possibilities for responses to alternative assessment tasks may be broader than those of traditional items, developers must work harder to specify the type of response they want to evoke from the task. For example, the QUASAR project has developed a scheme for classifying tasks that involves four dimensions: (1) cognitive processes (such as understanding and representing problems, discerning mathematical relationships, organizing information, justifying procedures, etc.); (2) mathematical content (which is in the form of categories that span the curriculum); (3) mode of representation (words, tables, graphs, symbols, etc.); and (4) task content (realistic or nonrealistic). By classifying tasks along four dimensions, the QUASAR researchers can capture much of the richness and complexity of high-level mathematical performance.

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

The QUASAR project has also developed a Cognitive Assessment Instrument (QCAI) 10 to gather information about the program itself and not individual students. The QCAI is a paper-and-pencil instrument for large-group administration to individual students. At each school site, several dozen tasks might be administered, but each student might receive only 8 or 9 of them. A sample task developed for use with sixth grade students is at left. 11

Sample QUASAR Task

The table shows the cost for different bus fares.

BUSY BUS COMPANY FARES

Weekly Pass $9.00

Yvonne is trying to decide whether she should buy a weekly bus pass. On Monday, Wednesday and Friday she rides the bus to and from work. On Tuesday and Thursday she rides the bus to work, but gets a ride home with her friends.

Should Yvonne buy a weekly bus pass?

Explain your answer.

The open-ended tasks used in the QCAI are in various formats. Some ask students to justify their answers; others ask students to show how they found their answers or to describe data presented to them. The tasks are tried out with samples of students and the responses are analyzed. Tasks are given internal and external reviews. 12

Internal reviews are iterative, so that tasks can be reviewed and modified before and after they are tried out. Tasks are reviewed to see whether the mathematics assessed is important, the wording is clear and concise, and various sources of bias are absent. Data from pilot administrations, as well as interviews with students thinking aloud or explaining their responses, contribute to the internal review. Multiple variants of a task are pilot tested as a further means of making the task statement clear and unbiased.

External reviews consist of examinations of the tasks by mathematics educators, psychometricians, and cognitive psychologists. They look at the content and processes measured, clarity and precision of language in the task and the directions, and fairness. They also look at how well the assessment as a whole represents the domain of mathematics.

The scoring rubrics are both analytic and holistic. A general scoring rubric (similar to that used in the California Assessment Program) was developed that reflected the scheme used for classifying tasks. Criteria for each of the three interrelated components of

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

the scheme were developed at each of the five score levels from 0 to 4. A specific rubric is developed for each task, using the general scoring rubric for guidance. The process of developing the specific rubric is also iterative, with students' responses and the reactions of reviewers guiding its refinement.

Each year, before the QCAI is administered for program assessment, teachers are sent sample tasks, sample scored responses, and criteria for assigning scores that they use in discussing the assessment with their students. This helps ensure an equitable distribution of task familiarity across sites and gives students access to the performance criteria they need for an adequate demonstration of their knowledge and understanding.

CURRICULAR RELEVANCE

The mathematics in an assessment may be of high quality, but it may not be taught in school or it may touch on only a minor part of the curriculum. For some purposes that may be acceptable. An external assessment might be designed to see how students approach a novel piece of mathematics. A teacher might design an assessment to diagnose students' misconceptions about a single concept. Questions of relevance may be easy to answer.

The term alignment is often used to characterize the congruence that must exist between an assessment and the curriculum.

Other purposes, however, may call for an assessment to sample the entire breadth of a mathematics curriculum, whether of a course or a student's school career. Such purposes require an evaluation of how adequately the assessment treats the depth and range of curriculum content at which it was aimed. Is each important aspect of content given the same weight in the assessment that it receives in the curriculum? Is the full extent of the curriculum content reflected in the assessment?

The term alignment is often used to characterize the congruence that must exist between an assessment and the curriculum. Alignment should be looked at over time and across instruments. Although a single assessment may not be well aligned with the curriculum because it is too narrowly focused, it may be part of a more comprehensive collection of assessments.

The question of alignment is complicated by the multidimensional nature of the curriculum. There is the curriculum as it exists

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.

in official documents, sometimes termed the intended curriculum; there is the curriculum as it is developed in the classroom by teachers through instruction, sometimes termed the implemented curriculum; and there is the curriculum as it is experienced by students, sometimes termed the achieved curriculum. Depending on the purpose of the assessment, one of these dimensions may be more important than the others in determining alignment.

Consider, for example, a curriculum domain consisting of a long list of specific, self-contained mathematical facts and skills. Consider, in addition, an assessment made up of five complex open-ended mathematics problems to which students provide multi-page answers. Each problem might be scored by a quasi-holistic rubric on each of four themes emphasized in the NCTM Standards: reasoning, problem solving, connections, and communication. The assessment might be linked to an assessment framework that focused primarily on those four themes.

Better methods are needed to judge the alignment of new assessments new curricula.

An evaluator interested in the intended curriculum might examine whether and with what frequency students actually use the specific content and skills from the curriculum framework list in responding to the five problems. This examination would no doubt require a reanalysis of the students' responses because the needed information would not appear in the scoring. The assessment and the intended curriculum would appear to be fundamentally misaligned. An evaluator interested in the implemented curriculum, however, might be content with the four themes. To determine alignment, the evaluator might examine how well those themes had been reflected in the instruction and compare the emphasis they received in instruction with the students' scores.

The counting and matching procedures commonly used for checking alignment work best when both domains consist of lists or simple matrices and when the match of the lists or arrays can be counted as the proportion of items in common. Curriculum frameworks that reflect important mathematics content and skills (e.g., the NCTM Standards or the California Mathematics Framework) do not fit this list or matrix mode. Better methods are needed to judge the alignment of new assessments with new characterizations of curriculum.

Suggested Citation:"6 Evaluating Mathematics Assessment." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.