Understanding Criterion-Referenced and Norm-Referenced Testing

criterion referenced proficiency testing n.w

1 / 51

Embed Share

Explore the differences between Criterion-Referenced (CR) and Norm-Referenced (NR) testing, their characteristics, and the importance of alignment in creating CR tests. Discover why not all language tests are proficiency tests and how testing experts emphasize the need for alignment in construct definition, test design, and development.

olda172 Follow

Uploaded on Jul 05, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford

Norm-Referenced (NR) or Criterion-Referenced (NR) Does it make a difference? The classic distinction made between CR and NR has focused on why the test is given. Norm-Referenced tests compare people with each other. Criterion-Referenced tests compare each person against a criterion or standard. But there are other differences as well!

Test Characteristics of Typical NR and CR Tests Domains Covered Type of Learning Tested Scoring Procedures Example Testing Purpose Chapter test in a textbook. Compare a student's ability with the ability scores of other students. Curriculum- based content. Direct- application of rote learning and rehearsed material. Generate a single (total or average) test score. NR Tests STANAG 6001 OPI. Determine whether the test taker has specified real- world abilities. Curriculum- independent, real-world content. Far-transfer learning that enables spontaneous and creative responses. Generate a floor and ceiling proficiency rating. CR Tests

Are all language tests proficiency tests? No. Achievement tests assess rote, direct application learning. Performance tests assess rehearsed near transfer learning. Proficiency tests assess far-transfer learning. And not all tests that are called proficiency tests are criterion-referenced tests.

Creating a C-R Test The Construct must have Task, Context, and Accuracy (TCA) expectations. The following elements must be aligned: The construct to be tested. The test design and development . The scoring model. If all of these elements are aligned, the test is legally defensible. Shrock, S. A. & Coscarelli, W. C. (2007). Criterion-referenced test development: Technical and legal guidelines for corporate training and certification. (3rd.ed.). San Francisco, CA: John Wiley and Sons.

Creating a CR Language Test 1. Define incremental TCA stages of the trait (where each stage is more complex in dimensionality than the preceding stage). 2. Maintain strict alignment of the: Theoretical construct model. Test development model. Psychometric scoring model. Luecht, R. (2003). Multistage complexity in language proficiency assessment: A framework for aligning theoretical perspectives, test development, and psychometrics. Foreign Language Annals, 36, 527-535.

Testing Experts Stress the Need for Alignment Construct Definition Test Design & Development Test Scoring

Further Research With the approval of ATC, Permissive BAT research continued using English language learners interested in applying for admittance to a U.S. university. A diversity of first languages was represented among the test takers. The number who have taken the BAT Reading test now exceeds 600. With 600+ test takers, we have done the IRT analyses needed for adaptive testing.

Validating the BAT 1. WinSteps IRT Analyses confirmed that The BAT test items did cluster by difficulty level and the clusters didn t overlap. The average logit values for each level were separated by more than 1 logit. 2. Clustered items were then assembled into testlets, and the 5-item testlets were assigned to the appropriate stage. For every level, the testlets were of comparable difficulty within 0.02 logits.

The BAT Reading Test Is a Criterion-Referenced Proficiency Test. Defines the construct with a hierarchy of level- by-level TCA requirements. Follows those level-by-level criteria in the test design and development process. Applies the same level-by-level criteria in its scoring process. Has been empirically validated. Items clustered in difficulty at each level. The clusters did not overlap.

The BAT: A CR Reading Test 1. WinSteps IRT Analyses confirmed that The BAT test items did cluster by difficulty. The clusters didn t overlap. 2. When the clustered items were assembled into 5-item, level-specific testlets: For every level, the testlets were of comparable difficulty within 0.02 logits. The standard error ranged from .04 to .06. The average item logit values for each level were separated by more than 1 logit.

WinSteps Results (With 4 testlets of 5 items each / level) n = 680 STANAG 6001 Level Logit value of Testlet A Logit value of Testlet B Logit value of Testlet C Logit value of Testlet D Standard Error of the model in Logits 3 + 1.8 + 1.8 + 1.8 + 1.8 .04 2 + 0.3 + 0.3 + 0.3 + 0.2 .04 1 - 1.5 - 1.6 - 1.6 - 1.7 .06

Testlet Difficulty by Level Testlet difficulty within +/-.02 logits at each level Standard Error < .06 Vertical distance between clusters > 1 logit

And how did the national test results compare to the results of the validated BAT test?

BAT and National Listening Scores 59% exact agreement of BAT and National Scores. But there was also disagreement. 34% disagreed by 1 level. 7% disagreed by 2 levels. Of the disagreement s, the National score was HIGHER in 81% of the cases.

BAT and National Speaking Scores 50% exact agreement of BAT and National Scores. But there was also disagreement. 45% disagreement by 1 level. 5% disagreement by 2 levels. Of the disagreements, the National score was HIGHER in 97% of the cases.

BAT and National Reading Scores 65% exact agreement of BAT and National Scores. When there was also disagreement. 33% disagreement by 1 level. 2% disagreement by 2 levels. Of the disagreements, the National score was HIGHER in 81% of the cases..

BAT and National Writing Scores 47% exact agreement of BAT and National Scores. When there was also disagreement 43% disagreement by 1 level. 10% disagreement by 2 levels. (1/3 and 2/4) Of the disagreements, the National score was HIGHER in 98% of the cases..

Overall Observations The national test results were within 1 level of the BAT results 90% to 98% of the time. But they matched the BAT results exactly only 47% to 65% of the time. The general trend was that the non- matching national ratings were higher. How can we improve these numbers?

Was there a slip at the construct stage? ? Construct Definition Test Design & Development Test Scoring

STANAG 6001 Construct Definition Every base STANAG 6001 level description has 3 components: Tasks (Communication Functions) Context (Content/Topics) Accuracy (Precision expectations) At every base level, each of these components is different from the descriptions at the other levels. The levels are not a single scale , but a hierarchy of Criterion-Referenced abilities.

High-Level Language Learning Requires Far-Transfer A Proficiency Summary with General Text Characteristics and Learning Types (Green = Far Transfer, Blue = Near Transfer, Red = Direct Application) LEVEL FUNCTION/TASKS CONTEXT/TOPICS ACCURACY All expected of an educated NS [Books] Accepted as an educated NS All subjects 5 Tailor language, counsel, motivate, persuade, negotiate [Chapters] Wide range of professional needs Extensive, precise, and appropriate 4 Errors never interfere with communication & rarely disturb Support opinions, hypothesize, explain, deal with unfamiliar topics [Multiple pages] Practical, abstract, special interests 3 Intelligible even if not used to dealing with non-NS Concrete, real- world, factual Narrate, describe, give directions [Multiple paragraphs] Q & A, create with the language [Multiple sentences] 2 Intelligible with effort or practice 1 Everyday survival 0 Memorized [Words and Phrases] Random Unintelligible

The Construct to be Tested Proficient reading: The active, automatic, far-transfer process of using one s internalized language and culture expectancy system to efficiently comprehend an authentic text for the purpose for which it was written. Author purpose Orient Inform Evaluate Reading purpose Get necessary information Learn Evaluate and synthesize

Definition of Proficient Reading Proficient reading is the active, automatic, far transfer process of using one s internalized language and culture expectancy system to efficiently comprehend an authentic text for the purpose for which it was written.

STANAG 6001 Reading Grid Conditions Accuracy Expectations Reader Task Level Author Purpose Text Type Content Multiple, well-organized abstract concepts interlaced with attitudes and feelings. Social/cultural/political issues, with abstract aspects and supporting facts presented as well. Most allusions and references are explained by their context. Understand literal and figurative meanings by reading both "the lines" and "between the lines". Recognize the author's tone and unstated positions. Evaluate the adequacy of arguments made. Evaluate situations, concepts, and conflicting ideas. Present and support arguments and/or hypotheses with both factual and abstract reasoning. Multiple-paragraph prose on a variety of professional or abstract subjects such as found in editorials, formal papers, and professional writing. Understands the facts, nuanced details and the author's opinion, tone, and attitude. 3 Convey structured, factual information, supporting details and factual relationships in extended narratives and descriptions. News reports, magazine articles, short stories, human interest features, and instructional and descriptive materials. Concrete information about real-world phenomenon with supporting details, as well as interrelated facts about world, local, and personal events. Understand the facts and supporting details including any causal, temporal, and spatial relationships. Grasps both the main ideas and the supporting details. 2 Information about places, times, people, etc. that are associated with everyday events, personal invitations, or general information. Understand the main idea, orient oneself by identifying the topic or main idea. Orient by Very simple announcements; ads, personal notes. Recognizes the main idea and some broad, categorical distinctions. 1 Communicating one or more general ideas. Sparse or random; format or external context may reveal internal relationships. Recognize some random items in a list or short text. Correctly recognizes some words. 0 List, enumerate. Lists, simple tables.

Did the national test construct adhere to the TCA alignment expected in the updated STANAG 6001 blueprint?

Was there a slip at the Design and Development stage? Construct Definition Test Design & Development ? Test Scoring

Did the test design and development process follow the level-by-level approach of the STANAG 6001 blueprint? Difficult Items Level 3 Items Level 2 Items Not=> Level 1 Items Easy Items

If the national tests followed a traditional design of a single bank of test items that design would reduce or limit scoring accuracy, even if the difficulty range of the items is identical to the difficulty range of the 3-stage test.

In fact, if a multi-level test is designed to report a single score, that test is no longer a CR test! A single score mixes criteria and includes test results across all of the levels tested. Then, the test takers total score includes: Their score at their sustained ability level. Their score resulting from their partial control of the next-higher level. Their score resulting from conceptual control at the second-higher level.

Was there a slip in the scoring process? Construct Definition Test Design & Development ? Test Scoring

Test Scoring Procedures: The Single Score Approach Note: Relating a single, overall score to a multi-level set of criteria (such as the hierarchical set of STANAG 6001 criteria) presents formidable theoretical and practical challenges. That is why multiple statistical and quasi- statistical procedures have been developed to accomplish this cut-score-setting task.

Test Scoring Procedures: The Single Score Approach Level 3 Group Groups of known ability 100 Test results to be calibrated Level 2 Group 50 Level 1 Group 0

The Results One Hopes For: Level 3 Group 100 Test results to be calibrated Groups of known ability Level 2 Group 50 Level 1 Group 0

The Results One Always Gets (Some test takers score below and score above their known ability.) Level 3 Group 100 Groups of known ability Test results to be calibrated ??? ??? Level 2 Group 50 ??? Level 1 Group ??? 0

No matter where the cut scores are set, they are wrong for some test takers. Level 3 Group 100 Groups of known ability Test Scores Received ??? ??? Level 2 Group 50 ??? Level 1 Group ??? 0

Why is a single score so difficult to interpret? Language learners do not completely master the communication tasks and topical domains of one proficiency level before they begin learning the skills described at the next higher level. Usually, learners will develop conceptual control or even partial control over the next higher proficiency level by the time they have attained sustained, consistent control over the lower level.

If STANAG 6001 Levels Were Buckets The blue arrows indicate the water (ability) observed at each level. 3 2 1 Notes: The buckets may begin filling at the same time. Some Level 2 ability will develop before Level 1 is mastered. That is ok, because the buckets will still reach their full (mastery) state sequentially.

Is there a better way than indirect extrapolation from a single score to assign proficiency levels? Would close adherence to the proficiency scale levels and C-R scoring improve testing accuracy?

A mini experiment: The BAT was scored two ways. The floor and ceiling criterion- referenced ratings were more defensible than were the total-score results. The criterion-referenced rating process ranked 37% of the test takers differently than they were ranked by their total score results. Let s see why that happens

Results for Two Students On a multi-level reading test. There were 60 total points possible. 20 points for Level 1. 20 points for Level 2. 20 points for Level 3. Lets compare their total scores and their level-by-level, floor and ceiling , criterion- based scores.

Example A: Alices total score = 35 C-R Proficiency Level = 2 Level 2 (with Random abilities at Level 3) Level 1 17 points, 85% Level 2 14 points, 70% Level 3 "Almost all" Most Some 4 points, 20% None

Example B: Bobs total score = 37 C-R Proficiency Level = 1+ Level 1 (with Developing abilities at Level 2) Level 1 17 points, 85% Level 2 Level 3 "Almost all" 11 points, 55% Most 9 points, 45% Some None

A Comparison of Results Single Score Results C-R Results Alice: 35 total points. Bob: 37 total points. Alice: Level 2 Bob: Level 1+ Based on their total scores, where would you set the cut-score between 1+ and 2? Would both be given a 2? Or both a 1+?

A Comparison of Results Single Score Results C-R Results Alice: 35 total points. Bob: 37 total points. Alice: Level 2 Bob: Level 1+ Based on their total scores, where would you set the cut-score between 1+ and 2? Would both be given a 2? Or both a 1+? CR scoring solves that problem!

To Apply C-R Scoring to STANAG 6001 Tests Calculate a separate score for each level. With a separate score for each level, that level-specific score is not influenced by developing abilities and guessing at other levels. The test taker s proficiency level is then determined by a comparison of her/his level-specific scores, not by the total score.

Advantages of C-R Scoring C-R scoring is non-compensatory scoring (whereas a single overall score on a multi- level test is always a compensatory score). C-R test designs combined with C-R scoring yields a score for each level tested. Floor and Ceiling scores can explain ability distinctions that would be regarded as error variance in multi-level tests that report only a single total test score.

Are you ready to take your test scoring to the next level? 1. If the difficulty indices of your test items group those items into level-specific clusters, and 2. If those level-specific clusters array themselves in an ascending hierarchy of difficulty groupings, you are ready!