Comparative Study of Generative AI vs. Human in Generating Multiple Choice Questions

Comparative Study of Generative AI vs. Human in Generating Multiple Choice Questions
Slide Note
Embed
Share

This study compares the effectiveness of Generative AI and human-generated multiple-choice questions based on the PIRLS Reading Assessment Framework. The aim is to assess the quality and objectivity of questions generated by both methods in an educational context.

  • AI Research
  • Education Assessment
  • Comparative Study
  • Question Generation
  • PIRLS Framework

Uploaded on Feb 17, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. UCL Festival of Digital Research and Scholarship Comparative Study of Generative AI (ChatGPT) vs. Human in Generating Multiple Choice Questions Based on the PIRLS Reading Assessment Framework Dr Elsie Ong and Professor Samuel Chu Elsie.ong@ucl.ac.uk 11th, June 2024 12:45-1pm

  2. A bit about how the idea came about Knowledge Overlord (KO) is a self-sustaining AI game-based online platform to enhance students literacy ability and 21st Century skills. A 3-year project funded by Quality Education Fund (HKD$10million) Utilize the mechanics of gamified learning to encourage students (users) to read and learn. Promote literacy in both reading comprehension and writing based on PIRLS framework Supplement schools method of monitoring/assessing reading assignments and customize difficulty based on users performance to provide a challenging experience. Create a user-constructed curriculum with a self-sustaining community where users together will create, evaluate, comment and perfect questions for books they have read. Foster 4Cs (Critical thinking, Communication, Creativity, Collaboration), digital literacies (Information, Media, and Technology literacies), 2 of the 3Rs (Reading, Writing) and lifelong learning and career skills

  3. Human-generated multiple-choice questions MCQs are commonly used to ensure objective evaluation in education. Problems generating high-quality questions is difficult and time-consuming. Solutions Generative artificial intelligence (GenAI) has emerged as an automated approach for question generation, but challenges remain in terms of biases and diversity in training data.

  4. Aims to compare the quality of GenAI-generated MCQs with humans-created ones.

  5. The PIRLS framework The PIRLS (Progress in International Reading Literacy Study) is a widely recognized assessment framework used to evaluate reading comprehension skills among students worldwide. The framework covers 4 key aspects of reading literacy:

  6. Related studies Laupichler et al. (2024) conducted a comparative study between AI-generated questions and human questions in the medical preparatory exam. They found that there was no statistically significant difference in the difficulty of questions and there is a better outcome in evaluation on high- from low-performing students in LLM questions. The findings revealed that LLMs including ChatGPT could be successfully employed to create questions for formative exams in medical schools.

  7. METHOD The current research composed of two parts: Part 1 Questions Creation by human versus by GenAI; Part 2 Questions Assessment. evaluate the capabilities of humans versus GenAI, specifically ChatGPT-4, in generating MCQs aligned with the PIRLS reading assessment framework. Data Sources In this study, two children's books were extracted from eFunReading.com and Reading Battle 2.0 online platform. They were chosen for their appeal to the target age group of 8-12 and their diversity in genre.

  8. METHOD Part 1: Question Creation Group 1 focused on drawing upon the human expertise to select questions available from the data source while Group 2 emphasized generation of the MCQs by ChatGPT with the alignment with the PIRLS assessment framework. Group 1 (Human Expert): One human expert was selected as they have extensive experience in question creation and understanding of framework. 64 MCQs were selected from the data source which included eight questions from each of the two books MCQs dataset, ensuring a balanced representation across the PIRLS assessment levels. Group 2 (GenAI): Few-Shot Learning (FSL) was applied to GenAI models, where ChatGPT was given explicit MCQ examples for each PIRLS level alongside the book excerpts. Few-Shot Learning (FSL) refers to the ability of GenAI models to generalize few labeled training examples (Parnami & Lee, 2022). This approach aimed to refine the AI s question generation process further. a profound PIRLS the

  9. Assessment criteria The assessment tool categorized into four domains were adapted from the measures used in Cheung et al. (2023)

  10. The assessment tool categorized into four domains were adapted from the measures used in Cheung et al. (2023): 1. Alignment with the specified PIRLS levels: determine if the question is set at the level stated (1-Information Retrieval, 2- Inferences, 3- Interpretation, 4- Evaluation); 2. Clarity and Specificity: determine if the question is clear and specific without ambiguity, its answerability and without being under- or over-informative; 3. Appropriateness: determine if the question is correct, appropriately constructed with appropriate length and well-formed; 4. Suitability for Specific Age Group: determine whether the question is suitable for assessing reading comprehension skills in the target age group. Each question was assessed on the reviewer s agreeableness to the statement on a numeric scale from 0 10, with 0 being as extremely disagree to 10 as extremely agree. The assessors were also asked to determine if the questions were constructed by AI or by humans, which they were blinded by the total number of questions created.

  11. Please read through each reading sources, Multiple Choice Question (MCQ) and the options of answers below: Passage 1: Jack and the Beanstalk "Once upon a time there was a boy called Jack. Jack lived in a cottage with his mother. They were very poor and their most valuable possession was a cow. One day, Jack s mother asked him to take the cow to market to sell. On the way, Jack met a man who gave him some magic beans in exchange for the cow. When Jack came home with the beans his mother was angry. She threw the beans out of the window and sent him to bed. The next morning, Jack looked out of the window. A giant beanstalk had grown in the garden! Jack decided to climb the beanstalk. It was so tall it went right up to the sky and through the clouds. When Jack finally reached the top, he saw an enormous castle. Jack decided to go inside; all the furniture was huge! Suddenly, Jack heard a loud noise. He ran into a cupboard to hide. An enormous giant came into the room. Fee, Fi, Fo, Fum, I smell the blood of an Englishman! he bellowed. The giant sat down at the table. On the table was a hen and a golden harp. Lay! said the giant. The hen laid an egg; it was made out of solid gold. Sing! said the giant. The harp sang; the giant fell asleep. Jack jumped out of the cupboard and took the hen and the harp. As Jack ran the harp cried, Help master! The giant woke up and called, Fee, Fi, Fo, Fum, I smell the blood of an Englishman! He chased Jack to the top of the beanstalk. Jack climbed down the beanstalk and the giant followed him. As Jack got to the bottom of the beanstalk he shouted, Help! Jack s mother came out with an axe. She used it to chop the bottom of the beanstalk. The giant fell and crashed to the ground. He was never seen again. With the golden eggs and the magic harp, Jack and his mother lived happily ever after." Sample question What did Jack bring from the castle? 1. Gold 2.Silver 3.Coins 4.Copper Answer: (A)

  12. Results Source Correctly Identified Incorrectly identified Total Percentage Correctly Identified AI- 21 43 64 32.81% Generated Human- Generated 36 28 64 56.25% Table 1: Identification Accuracy of AI-Generated vs. Human-Generated MCQs As the data did not meet the assumptions required for the parametric tests partly due to small sampling number, three Wilcoxon signed-rank tests were conducted and indicated that there were no significant differences between Human and GenAI in their ratings of (1) Clarity and Specificity, (2) Appropriateness, and (3) Suitability for Specific Age Group (p > .05). However, MCQs generated by GenAI were significantly higher in the ratings of PIRLS assessment framework alignment (7.98) than those generated by humans (6.61), p = .035.

  13. Results AI Mean Score Human Mean Score Assessment Domain P Alignment with PIRLS Level (1- 10) 7.98 6.61 0.03485 Clarity and Specificity (1-10) 7.86 6.86 0.04238 Appropriateness (1-10) 7.92 6.56 0.06941 Suitability for Specific Age Group (1-10) 7.81 6.56 0.07876

  14. DISCUSSION MCQs generated by GenAI were comparable to humans on the Clarity and Specificity, Appropriateness, and Suitability for Specific Age Groups. Based on the ratings of the four assessors, both humans and GenAI created clear and accurate questions efficiently. They were effective in generating a diverse range of questions that covered various levels in the PIRLS framework. The percentage of correctly identified MCQs in One of the primary challenges in categorizing questions within the PIRLS framework is the ambiguity in question phrasing. Questions may contain multiple components or require question creators to make inferences based on implicit information, making it difficult to assign them to a specific PIRLS level. This ambiguity can lead to inconsistencies in categorization and affect the overall reliability of the assessment.

  15. IMPACT by demonstrating that GenAI can effectively create MCQ questions based on topic, difficulty level, and question format, the study highlights the potential for our project to utilize AI tools to customize assessments to better fit the unique needs of students and their curriculum. This flexibility allows us to quickly produce tailored assessment materials that align with specific learning objectives and student capabilities, thereby enhancing the relevance and effectiveness of the evaluation process. the efficiency and scalability of AI-generated questions can reduce the workload for educators, freeing up more time for personalized teaching and student engagement.

  16. CONCLUSION As educational technologies continue to evolve, integrating GenAI into the question creation process promises to revolutionize how educators approach testing, potentially leading to more dynamic, responsive, and individualized learning environments. Some questions often contain elements of multiple complexity levels, making it challenging to assign them to a specific PIRLS level. In light of the subjectivity of the MCQ creation and categorization exercises, the provision of comprehensive training for assessors and GenAI on both the categorization criteria and guidelines within the PIRLS framework can help improve consistency in scoring practices.

  17. References Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., & Sakr, M. (2024). A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education. In Proceedings of the 26th Australasian Computing Education Conference (ACE 2024). ACM. https://doi.org/10.1145/3636243.3636256 Ibrahim, A., Alhosani, N., & Vaughan, T. (2020). Impact of language and curriculum on student international exam performances in the United Arab Emirates. Cogent Education, 7(1). https://doi.org/10.1080/2331186X.2020.1808284 IEA. (2021). PIRLS 2021 Assessment Framework. Progress in International Reading Literacy Study. ERIC - ED606056 - PIRLS 2021 Assessment Frameworks, International Association for the Evaluation of Educational Achievement, 2019 Howie, S., Combrinck, C., Roux, K., Tshele, M., Mokoena, G., & Palane, N. M. (2017). Progress in international reading literacy study 2016: South African children s reading literacy achievement. Centre for Evaluation and Assessment. Laupichler, M. C., Rother, J. F., Kadow, I. C. G., Ahmadi, S., & Raupach, T. (2024). Large Language models in Medical Education: comparing ChatGPT-to Human-generated exam questions. Academic Medicine, 99(5), 508-512. Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (Eds.). (2017). PIRLS 2016 assessment framework. International Association for the Evaluation of Educational Achievement (IEA). Su, J., & Yang, W. (2023). Unlocking the Power of ChatGPT: A Framework for Applying Generative AI in Education. ECNU Review of Education, 6(3), 355-366. https://doi.org/10.1177/20965311231168423 Parnami, A., & Lee, M. (2022) Learning from Few Examples: A Summary of Approaches to Few-Shot Learning

Related


More Related Content