Assessment Quality and Item Difficulty in Educational Testing

1 / 35

Embed Share

Explore the concepts of assessment quality, item difficulty, and item discrimination in educational testing. Learn how to evaluate the difficulty of test items and interpret the discrimination indices to enhance educational assessments effectively.

bakk_yo Follow

Uploaded on Mar 22, 2025 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ASSESSMENT QUALITY

ITEM QUALITY Item difficulty Item discrimination Item scoring

ITEM DIFFICULTY (1) Difficulty index for dichotomous (0/1) items Number of students responding correctly to an item Difficulty (p) = Total number of students responding to the item Higher values of p indicate item easiness p=.80: 80% of students answered an item correctly Lower values of p indicate item difficulty p=.20: 20% of students answered an item incorrectly On a four-option multiple test, a .25 p value by chance (due to guessing) would be expected

ITEM DIFFICULTY (2) Difficulty Index for rubric-scored (polytomous) items items Sum of student scores on the item for all students Difficulty = Total number of students responding to the item Average student score for the item Higher values indicate item easiness On a 4-point rubric scored item, difficulty of 3.5 means most students achieved a high score Lower values indicate item difficulty On a 4-point rubric scored item, difficulty of 1.5 means most students scored low Can make comparable to p-values for dichotomous values by dividing by total number of points possible

WHAT ITEM DIFFICULTY TELLS US I evaluated the difficulty of the items in my assessment by finding the percentage of students who passed each item. I found that most of my items had a p value (or difficulty index) of between .2 and .8 which means that they were neither too easy or too difficult. Additionally, the items that were lower on the learning progression had higher values, meaning that more students passed the easier items and the items that were lower on the learning progression were had lower values, meaning that fewer students passed the more difficult items. This shows me that the sequence in my learning progression was pretty accurate in terms of reflecting increasing levels of sophistication as you more up the learning progression.

ITEM DISCRIMINATION (1) Discrimination Index a relationship between students total test scores and their performance on a particular item Type of Item Proportion of Correct Responses on Total Test Higher Scorers > Low Scorers Positive Discriminator Negative Discriminator Higher Scorers < Low Scorers Nondiscriminator Higher Scorers = Low Scorers

ITEM DISCRIMINATION (2) Computing an item s discrimination: 1. Order the test papers from high to low by total score 2. Choose roughly the top 25% and the bottom 25% of these papers (e.g. if you have 25 students, you will choose about the top and bottom 6 students) 3. Calculate a p-value item difficulty (see previous slides) for each of the high and low groups 4. Subtract p(low) from p(high) to obtain each item s discrimination index (D); D = p(h) p(l)

ITEM DISCRIMINATION (3) Guidelines for evaluating the discriminating efficiency of items (Ebel & Frisbie, 1991) Discrimination Index Item Evaluation .40 and above Very good items .30 - .39 Reasonably good items, but possibly subject to improvement Marginal items, usually needing improvement .20 - .29 .19 and below Poor items, to be rejected or improved by revision

WHAT ITEM DISCRIMINATION TELLS US I evaluated the how well each item on my assessment differentiated between high and low performing students by finding the difference in item difficulty between high and low performing groups. A good item should have a item discrimination value of .40 of above, which means that the item is useful in differentiating between students who are doing well overall, and students who are having trouble.

RELIABLE ITEM SCORING Selection of a set of responses Random samples of student work, or Deliberately select from high-, medium- and low-achieving groups of students Score by multiple people Check for exact matches, difference of one score, difference of two or more Flag disagreements Discuss the results Evaluation of the effectiveness of scoring rubrics Training for additional scoring Evaluation of inter-rater reliability

ASSESSMENT QUALITY According to the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education Standards, assessment quality includes Reliability Validity Fairness

RELIABILITY What is it? "Could the results be replicated if the same individuals were tested again under similar circumstances? Reliability is the "consistency" or "repeatability" of students assessment results or scores Factors influencing consistency: Content or particular sample of tasks included on the assessment (e.g. item selection, different test form) Occasion on which the assessment is taken (e.g. test time, student s condition, classroom environment)

VALIDITY Definition: Does the test measure what it was designed to measure? 5 standards of validity evidence (AERA, APA, NCME, 2014) Based on instrument content (Content Validity) Based on response processes Based on internal structure (Construct Validity) Based on relationship with external variable Based on consequences

EVIDENCE OF VALIDITY #1 Evidence based on instrument content Do the items appear to measure the intended content? Are there items that measure all aspects of the content? Can come from expert judgments of items Often based on a test blueprint Are all items associated with a part of the blueprint? Grainsize may be larger or smaller E.g. problem solving vs. finding patterns or explaining the correspondence between tables and graphs Are all intended sections represented by one or more items?

LEARNING PROGRESSION-BASED TEST BLUEPRINT LP level 1 LP level 2 LP level 3 LP level 4 LP level 5 Cognitive Process Level L1 Remember L2 L3 One chooses the Learning Progression (LP) levels that reflect the instructional plan for the unit. One chooses the appropriate levels of cognitive rigor for the age and ability of students Each LP level is represented by the appropriate number of items Each chosen level of cognitive rigor is represented by the appropriate number of items Note: Not all LP levels will include all levels of cognitive rigor.

WHAT CONTENT VALIDITY TELLS US I evaluated the content validity of my assessment using an assessment blueprint to make sure that all intended standards are measured at the intended level of complexity by my items, which provides me confidence that the items are an accurate representation of the concepts in the learning progression.

EVIDENCE OF VALIDITY #2 Evidence based on response processes Does the test format influence the results? E.g. is a mathematical problem description so complex that reading level is also a factor in responses? E.g. effect of test format: Visual clutter or difficult-to-read font Make sure to give enough space for students to write complete answers Often based on interviews of selected students Cognitive interview or think-alouds in which students describe what they are thinking as they answer the items exit interviews or questionnaires, in which students, when finished with the test, describe what they liked/didn t like, what they found easiest and hardest to understand, etc.

TWO METHODS TO OBTAIN INFORMATION ABOUT RESPONSE PROCESSES Think Alouds Observing students who talk through their responses Exit Interviews Asking students to reprise their performance after taking instrument Asking them about their experiences

THINK-ALOUD RESULTS

EXIT INTERVIEWS Example Questions About how long did it take you to answer each question? Question _____: ___________ minutes Question _____: ___________ minutes What parts of the test are confusing? What makes the test hard to understand or answer? Did you go back and change any of your answers at any point during the test? If so, why? Why didn t you write shorter explanations? Why didn t you write longer explanations? If you were writing this test, how would you change it to make it a better test?

WHAT RESPONSE PROCESSES TELL US I evaluated the validity of my assessment using think-aloud interviews of several students, which provides me confidence that the students responses to the items reflect what they know and can do and were not influenced by the format, test conditions, or wording of the items, nor of misunderstanding of the questions

EVIDENCE OF VALIDITY #3 Evidence based on internal structure Do the items show the expected difficulty order (i.e. are the items intended to be easy actually easy, and those intended to be difficult actually difficult)? Is each item consistent with the overall purpose of the assessment? Should all the items be added into a single score, or should they be kept as separate subscores? E.g. arithmetic accuracy and problem solving

LOOKING AT INTERNAL STRUCTURE Are item difficulties as expected? Difficulty Difficulty Easy Medium Hard What I expected What I expected 1, 3, 5 2, 4, 6, 8 7, 9, 10 What happened What happened 1, 2, 3 4, 5, 6, 8 7, 9, 10 Actual difficulty order similar to expectations. Reminder: Sum of student scores on the item for all students Difficulty = Total number of students responding to the item

WHAT INTERNAL STRUCTURE TELLS US I evaluated the validity of my assessment by comparing expected to actual item difficulty, which provides me confidence that my items were performing as I expected. Item difficulty was unlikely to have been influenced much by unintended factors (e.g. complex non-mathematical vocabulary for a math item)

The next three slides relate to validity evidence that is most important in large-scale testing, and less easily measured in a single school classroom We will not be discussing means of collecting these types of validity

EVIDENCE OF VALIDITY #4 Evidence based on relations to external variables Does the test relate to external criteria it is expected to predict (e.g. other tests of similar content)? Is it less strongly related to criteria it would not be expected to predict (e.g. does a test of vocabulary relate more strongly to a reading test than to an arithmetic test; does an emotional intelligence test relate more strongly to a personality assessment than to a test of academic skills)

EVIDENCE OF VALIDITY #5 Evidence as to whether this assessment predicts as it should; e.g. Does a college entrance exam actually predict success in college? Does an occupational skills test actually measure skills that will be needed on the job in question?

EVIDENCE OF VALIDITY #6 Evidence based on consequences Is the consequence of using the assessment results as expected? E.g. content represented on test changing what is taught in classrooms, in undesirable ways For example, not including geometry-based items in elementary school assessments might result in teachers choosing not to teach these concepts Instructional consequences should be positive if the assessment method is valid and appropriate

BALANCING RELIABILITY AND VALIDITY

FAIRNESS Consistency and unbiasedness Students outcomes must not be influenced by the particular rater who scored their work Items must not unintentionally favor or disadvantage students from specific groups A fair test provides scores that: Are interpreted and used appropriately for specific purposes Do not have adverse consequences as a result of the way they are interpreted/used

EVIDENCE FOR FAIRNESS Reliable item scoring (slide #10) also gives us evidence for fairness Lack of item bias (also called differential item functioning) provides evidence for fairness Differential item functioning: Do two groups of students, with otherwise equal proficiency, perform differently on an item? E.g. do girls and boys perform differently on an essay written in response to a novel about basketball players? E.g. do English language learners perform differently than native English speakers on math story problems?

AN EXAMPLE OF ASSESSING ITEM FAIRNESS Item Item % boys % boys passing passing % girls passing % girls passing 1 .50 .49 2 .67 .69 3 .30 .59 4 .44 .45 5 .39 .37 6 .85 .89 7 .95 .95 8 .73 .76 9 .62 .59 10 .77 .76 Many more girls than boys passed item 3; this item shows evidence of differential functioning by gender. Groups don t always have to perform the same (e.g. ELL students may perform 10% worse than native speakers on all vocabulary items); as long as the difference between the groups is consistent, there is no differential functioning.

WHAT FAIRNESS TELLS US I evaluated the functioning of my items for both boys and girls, and found that all of my items performed equivalently for both groups. This gives me confidence that my test is fair with respect to gender.

BIBLIOGRAPHY Nitko, A. J., & Brookhart, S. (2007). Educational assessment of students. Upper Saddle River, NJ: Pearson Education, Inc. McMillan, J. H. (2007). Classroom assessment. Principles and practice for effective standard-based instruction (4th ed.). Boston: Pearson - Allyn & Bacon. Oregon Department of Education. (2014, June). Assessment guidance. Popham, W. J. (2014). Criterion-referenced measurement: A half-century wasted? Paper presented at the Annual Meeting of National Council on Measurement in Education, Philadephia, PA. Popham, W. J. (2014). Classroom assessment: What teachers needs to know. San Francisco, CA: Pearson Russell, M. K., & Airasian, P. W. (2012). Classroom assessment: Concepts and applications. New York, NY: McGraw-Hill. Stevens, D. & Levi, A. (2005). Introduction to rubrics. As assessment tool to save grading time, convey effective feedback, and promote student learning. Sterling: Stylus Publishing, LLC Wihardini, D. (2010). Assessment development II. Unpublished manuscript. Research and Development Department, Binus Business School, Jakarta, Indonesia. Wilson, M. (2005). Constructing measures: An item response modeling approach. New York: Psychology Press, Taylor & Francis Group. Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13 (2), pp. 181-208.

CREATIVE COMMONS LICENSE Assessment Quality PPT by the Oregon Department of Education and Berkeley Evaluation and Assessment Research Center is licensed under a CC BY 4.0. You You are free to: are free to: Share Share copy and redistribute the material in any medium or format Adapt Adapt remix, transform, and build upon the material Under Under the following terms: the following terms: Attribution Attribution You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial NonCommercial You may not use the material for commercial purposes. ShareAlike ShareAlike If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. Oregon Department of Education welcomes editing of these resources and would greatly appreciate being able to learn from the changes made. To share an edited version of this resource, please contact Cristen McLean, cristen.mclean@state.or.us.

Assessment Quality and Item Difficulty in Educational Testing

Download Presentation

Presentation Transcript

Related

More Related Content