Essential Factors in Assessment Methods
In this workshop presentation, participants will delve into the critical components of validity, reliability, and fairness in assessment design. They will learn statistical methods for scoring reliability verification, utilize checklists to assess fairness, and identify threats to assessment validity. Participants will explore data collection methods, interpretation techniques, scoring reliability checks, and the impact of assessment results on instructional strategies and learner outcomes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Properties Being Considered for Any Assessment Method Gavin T L Brown gt.brown@auckland.ac.nz Workshop Presentation at the Second Annual University Teaching Forum @ King Saud University, Riyadh February 11, 2013
Presentation Objectives Participants will : develop an understanding of the principles of validity, reliability, and fairness as they apply to the design of multiple assessment tasks learn to conduct simple statistical procedures for verifying reliability of scoring utilize check-lists for determining the fairness of assessment tasks utilize a validity chain checklist for determining threats to the validity of assessment processes
Gathering Information: How do we collect data? Methods Observe Behaviour Observe Performance on Set Task(s) Oral Question & Answer Paper & Pencil Questions Test, Examination One-off Collections; On-demand Prepared Many methods: all strong & weak Do: Make a list of all assessment methods used in your teaching. Compare with others in your faculty. How many different techniques?
Interpretation: How do I interpret the data? Making Sense of Information Compared to others (Norm Reference) Compared to content (Criterion Reference) Compared to standards (Levels of Performance) Compared to items right & wrong (Architecture of Performance) Compared to previous performance (Self Reference) Validity of Interpretation vital Think & Discuss: What reference system do you use in your Department/Faculty for interpreting performance?
Reliability: How much can I trust the information? Reliability of SCORING (not tests) Consistency between judges/teachers Consistency within judge/teacher across time, task, student Consistency between methods (e.g., test observation) Consistency within a method Precision & Accuracy in Measurement? Think & discuss: What methods do you use to check that scoring is trustworthy in your department?
Consequences: What happens with the results? Purpose for Assessment Change Teaching Content, Method, Timing Change Learner s Situation Groups, Classes, Programme Certify Learning Report & Reward or Withhold Qualification Certify Instructor/Institution Reject or Ignore Validity of Actions/Responses critical Think & Discuss: What does your department do with assessment results?
Activity: Who wants to be a .. 3 volunteers come on down! Hairdresser? Some questions to determine your suitability Name three types of hair. What is the best brand of shampoo to use? Who is an important hairdresser who regularly appears on TV ads? BONUS: What is the most important tool in a hairdresser s kit?
What went wrong? What threats were there to a valid interpretation that someone with a high score knows more about hairdressing? Discuss with Neighbours Report to class
Validity Defined Appropriateness of the inferences, uses, & consequences that result from assessment The soundness, trustworthiness, or legitimacy of the claims or inferences made on the basis of obtained scores Degree of soundness in the consequences of the inferences & decisions Not characteristic of a test; but a judgement McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction (2 ed.). Boston, MA,: Allyn & Bacon.(p. 59)
Validity Defined an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). Old Tappan, NJ: MacMillan. What kind of evidence is needed to judge that the inferences and decisions are appropriate?
Item /Task Construction Items/Tasks Domain Validity Chain Assessment Design Chain as Metaphor1 Administration All aspects are linked weakness at any one point calls into question all inferences & decisions Scoring Performance (Reliability) Score Aggregation No one link more important than any other Generalisability Merit Evaluation Links identify key aspects that must be evaluated Validation Evidence 1Source: Crooks, T. J., Kane, M. T., & Cohen, A. S. (1996). Threats to the valid use of assessments. Assessment in Education, 3(3), 265-285. Action Evaluation Consequences
Threats to Quality of Assessment Relationship of Items to Content Do items fit the content of interest? Design of Assessment Does the assessment cover important things; in desired balance? Are the distribution and type of items suited to the content of interest? Item Construction Quality Are the items well crafted? Range of cognitive processes required? Range of response formats?
A detour: Writing good multiple choice items Use where the task calls for a single, clear answer to a question. When well designed, emphasize critical thinking and reasoning, rather than factual recall. Use when the range of possible correct answers is too broad to focus thinking Use to remove load of writing; not thinking
Reasons to use M-C items Easy to sample widely across domain of interest Highly manageable Raise mean achievement fewer missing data responses Students like them (Elley & Mangubhai, 1992)
Disadvantages of M-C items Hard to write quality items Believed to test surface level processing, usually because of poor construction Guessing factor May require good reading level Recognition of answer & test wiseness
Anatomy of a multi-choice item Stem What is the best purpose of a test item? a. To measure student learning b. To keep the principal informed c. To produce a mark in the markbook d. To show how little has been learned Key Distractors
Writing M-C items: rules for stems Keep clear & concise specific is terrific! Not too long to read Avoid negatively worded questions. EmphasiseNOTif you must ask a negative question Check the answer is not elsewhere in paper Avoid clues in grammar (a/an; is/are etc) Use interrogatives (What is the name of this tool?) or imperatives (State the advantages of the )rather than sentence completion
Multiple Choice Questions with Negative Stems construct irrelevance Which of the following is not true of George Washington? a. He served only two terms as president. b. He was an experienced military officer before the Revolutionary War. c. He was born in 1732. d. He was one of the signers of the Declaration of Independence. Source of CIV-Difficulty for all test takers 1. Fundamentally focuses attention on non-important knowledge 2. Confusing, improbable and unnecessary negative wording leads to errors 3. Difficult for non-native language speakers 4. CIV-Difficulty is compounded with negatively worded options Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
Writing M-C items: rules for answers Only one correct answer the key. Answer is actually correct. Check, check, re-check! Answer is sufficient to answer the question. No pattern of correct answers. Do not repeat words in stem. Use typical errors students make.
Multiple Choice Distractors or wrong answers Plausible not silly or plainly wrong Connected to a commonly held misunderstanding, or An overgeneralisation or a narrowing of application Similarity to each other and answer Similar length Similar style as answer Match the grammar or style of stem or question Attract guessers & those who have imperfect or weak knowledge Arrange in a logical order alphabetical, numerical, time series Avoid implausible qualifiers e.g., never, always Avoid all of the above, none of the above
Generating distractors Common mis-understandings. Which is the value of 2+3? (Could use the following distractors) a. -1 (2-3) b. 5 (correct) c. 6 (2x3) d. 8 (23) e. 23 (writing digits together as a single numeral) Mis-interpretations of text Present to students as open-ended item.
Number of response alternatives Typically three, four or five Four or five favoured over three reduces guessing, increases discrimination Four is most common, but 3 is defensible Use three if entirely appropriate an acre is larger/smaller/equal to a hectare?
Example M-C Question Which year is associated with the early European exploration of New Zealand? a) 1215 b) 1492 c) 1642 d) 1852 Reasons for options: a) Signing of Magna Carta; a significant date in British history date may be known by students; has a 2 in it b) Year European settlement of United States was begun; has a 2 in it c) Year Abel Tasman first recorded sighting of New Zealand; has a 2 in it d) Year New Zealand Parliament constituted; has a 2 in it
A modern alternative: MCQ with multiple answers Which 3 statements about George Washington are correct? (Choose 3.) a. He served only two terms as president. b. He was an experienced military officer before the Revolutionary War. c. He was born in 1732. d. He was one of the signers of the Declaration of Independence. Significant reduction of CIV-Difficulty 1. Focuses student on relevant knowledge 2. Removes confusion 3. Easier to translate or adapt to other cultures/languages 4. Unfamiliar format may need a bit of practice Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
A modern alternative: Discrete Option Multiple Choice (DOMC) Pavlov s original research on digestion headed in a direction that he had not originally intended when he noticed that his dogs failed to salivate at all. Question will end when answered Correctly or Incorrectly. Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
A modern alternative: Discrete Option Multiple Choice (DOMC) Pavlov s original research on digestion headed in a direction that he had not originally intended when he noticed that his dogs began salivating just before being fed. Question will end when answered Correctly or Incorrectly. Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
A modern alternative: Discrete Option Multiple Choice (DOMC) Pavlov s original research on digestion headed in a direction that he had not originally intended when he noticed that his dogs did not salivate after being fed. Question will end when answered Correctly or Incorrectly. Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
A modern alternative: Discrete Option Multiple Choice (DOMC) Pavlov s original research on digestion headed in a direction that he had not originally intended when he noticed that his dogs salivated at random times. Question will end when answered Correctly or Incorrectly. Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
A modern alternative: Discrete Option Multiple Choice (DOMC) Pavlov s original research on digestion headed in a direction that he had not originally intended when he noticed that his dogs did not salivate until they tasted the food. Question will end when answered Correctly or Incorrectly. Foster, D. (2010, July). Common sources of construct irrelevant variance and what can be done about them. Paper presented at the biannual conference of the International Test Commission, Hong Kong.
Test of Objective Evidence Each of the questions in the following set has a logical or best answer from its corresponding multiple choice answer set. Best answer means the answer has the highest probability of being the correct one in accordance with the information at your disposal. There is no particular clue in the spelling of the words and there are no hidden meanings. Please record your eight answers. You have 60 seconds per pair of questions Discuss with partner and determine reason for answer
Questions 1--2 1. The purpose of the cluss infurmpaling is to remove a. cluss-prags b. tremails c. cloughs d. pluomots ( ) 2. Trassig is true when a. clump trasses the von b. the viskal flans, if the viskal is donwil or zortil c. the belgo fruls d. dissels lisk easily ( )
Questions 3--4 3. The sigia frequently overfesks the trelsum because a. all sigias are mellious b. all sigias are always votial c. the trelsum is usually tarious d. no trelsa are feskable 4. The fribbled breg will minter best with an a. derst b. morst c. sortar d. ignu ( ) ( )
Questions 5--6 5. The reasons for tristal doss are a. the sabs foped and the doths tinzed b. the dredges roted with the crets c. few rakobs were accepted in sluth d. most of the polats were thonced ( 6. Which of the following is/are always present when trossels are being gruven? a. rint and yost b. Yost c. shum and Yost d. yost and plone ) ( )
Questions 7--8 7. The mintering function of the ignu is most effectively carried out in connection with a. arazmatoi b. the groshing stantol c. the fribbled breg d. a frailly sush 8. __________________________ a. b. c. d. ( ) ( )
Broken M-C Rules 1(a) Repeats key word; first option 2(b) Longest option 3(c) Breaks syntactic pattern only singular 4(d) Grammatical cue; an requires Vowel 5(a) Grammatical cue; only Verb plural 6(b) Only one word constant in all options 7(c) Answer given elsewhere; Qn. 4 8(d) Follows pattern a,b,c,d,a,b,c,d
STOPPING BY WOODS ON A SNOWY EVENING Robert Frost Whose woods these are I think I know. His house is in the village though; He will not see me stopping here To watch his woods fill up with snow. Specify a learning objective(s). Create a MCQ based on this text for that objective without worrying about the quality just practice. Write & then compare. Can you see ways the item can be improved? My little horse must think it queer To stop without a farmhouse near Between the woods and frozen lake The darkest evening of the year. He gives his harness bells a shake To ask if there is some mistake. The only other sound s a sweep Of easy wind and downy flake. The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep.
Threats to Quality of Assessment Administration of Assessment Do circumstances of administration bring out best? E.g., low motivation, anxiety, inappropriate conditions, clarity of communication Scoring Performance on Items Are correct or model answers are actually correct? Do model answers cover all the important qualities that you expected? mechanism to handle the oddball answer? evidence that scoring was done consistently? Balance of analytic and holistic scoring? Are there any other ways of interpreting scores?
Threats to Quality of Assessment Aggregation of Scores into Scale Scores aggregated scale scores make more sense than score for every item. are items sufficiently similar in terms of content or style to be legitimately aggregated? what are the appropriate weights for the elements being brought together into one score? Test A B C 1 4 7 2 5 8 3 6 9
Threats to Quality of Assessment Generalisation from Items to Domain NB: all assessments are SAMPLES of a Domain Is the information sufficient & representative in order to dependably generalise to the domain we re interested in? What are the limits of the information s meaning? What does knowing the answers to these items tell us about person s knowledge of chapter and ability to function in domain? Domain Items Topic C Items Topic B Items Topic A TEST Chapter PS: What implication if all the marks came from one section?
Evaluation of Merit Items Correct Sub-Total What reasons are given for merit decision and are you persuaded by the evidence and reasoning? Merit Decision Total 2/5 55% is fail Weakness X Content A 55% is C- X X 9/15 is good 4/5 Strength 9/15 is poor Content B 9/15 = 55% 9/15 is uninterpretable X X 3/5 Ways of setting cut score: X Content C Consider Expected Standards Consider Minimum competency judgement Consider Average score of Norm population Adjust for Reliability of test Adjust for Difficulty of items Adjust for Quality of items Don t just accept Convention/Tradition
Threats to Quality of Assessment Evaluation of Merit of Performance What basis for evaluation? % correct, standing in group, criteria descriptions? NZ Scholarship rules Exam responses must be classified as outstanding or scholarship Rank Position must be top 3% for Scholarship; top 0.35% for Outstanding How accurate are scores? Are merit decisions biased against or in favour of students? Are merit decisions reliable & consistent across all populations?
Setting Standards Faculty in HE determine the merit of student performance and express it normally converted in grades: e.g., 95-99%=A+; 88-94%=A; 83-87%=A-; etc. Standards usually have meanings A=Highly Competent. High to exceptionally high quality; excellent knowledge and understanding of subject matter and appreciation of issues; well formulated arguments based on strong and sustained evidence; relevant literature referenced; high level of creative ability, originality and critical thinking; excellent communication and presentation skills and so on for each Grade
Whats a standard? The communal judgement of a group of peers Peer review of publications Peer review of graduate theses Peer review of essays, assignments, course-work People who use our grades (employers, parents, students) MUST have confidence that we are competent to judge and that we judge competently A
Standards referencing Natural & Normal process of Classification Degrees of quality; frequently 3 levels GOOD, OKAY, BAD (like Goldilocks) everything else is a variant Very good/excellent; good/merit; satisfactory/okay; poor/fair; inadequate Advanced, Proficient, Basic, Below Basic Could be 2 levels: Pass Fail; We perceive these qualities in most things not just academic work
So whats the problem? Tasks or tests will always be variably difficult A percentage score on a very hard task might not lead to the correct interpretation about the standard achieved 75% might be A if it was a hard test 75% might be C if it was an easy test So since we are basing our grading on Standards, not marks, how can we adjust for a tough test?
Standard Error of Measurement (SEM) Standard error of measurement consistency of within-person scores: O = + e How accurate is an Observed Score? How much error is there around the Observed score? What would the person really get if we could remove the error component? The SEM shows the range around the observed score in which the True Score falls An opinion poll value +/- a certain value SEM The size of the band indicates degree of confidence that an observed score contains the True score +/- 1 sem = 68% of the time +/- 2 sem = 96% of the time +/- 3 sem = 99.99% of the time
Exercise: SEM with a test/exam Imagine 3 students got these % marks for a test that has SEM=5 Person1 = 50; SEM = 5 Person2 = 47; SEM = 5 Person3 = 55; SEM = 5 Work out with 68% confidence a defensible rank order of students. Decide which students, if any, passed or failed the test. With REASONS! What checks would you want to carry out, if any, before finalising the marks?
Implications of SEM Shows range of accuracy High value indicates low accuracy; low value indicates high accuracy High stakes inferences and decisions should not be made without consulting SEM The SEM could be used to moderate the cutscore for minimum competency if you don t have time to make judgements
One approach to Standard Setting (modified Angoff) Think about students for whom the task is designed and the level of performance expected of those people Example: Stage 2, introductory course, beginners A means highly competent/excellent for this level of learner What an A beginner can do is not the same as what an A advanced practitioner can do A student test is not the same as the professor s test Level of expectation matters
One approach to Standard Setting Likewise Imagine the Minimally competent beginner (gets C-) What %, grade, score would such a person get on this task? Repeat for Good student and Excellent student Compare your scores with another tutor/lecturer Set cut scores for grades and award grades based on merit evaluation of standards not scores derived on a test or task of unknown difficulty