Learning to Love LLMs for Answer Interpretation
Chain-of-Thought Prompting and the AMMORE Dataset
DOI:
https://doi.org/10.18608/jla.2025.8621Keywords:
large language models (LLMs), formative assessment, math education, research paperAbstract
This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach — chain-of-thought prompting — accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems.
References
Abdelrahman, G., Wang, Q., & Nunes, B. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), 224. https://doi.org/10.1145/3569576
Allen, L. K., Snow, E. L., Crossley, S. A., Tanner Jackson, G., & McNamara, D. S. (2014). Reading comprehension components and their relation to writing. L’Année Psychologique, 114(4), 663–691. https://doi.org/10.4074/S0003503314004047
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27(1), 3–23. https://doi.org/10.2307/3315487
Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92(1), 81–90. https://doi.org/10.1177/003172171009200119
Botelho, A., Baral, S., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2023). Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. Journal of Computer Assisted Learning, 39(3), 823–840. https://doi.org/10.1111/jcal.12793
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117. https://doi.org/10.1007/s40593-014-0026-8
Cechinel, C., Ochoa, X., Lemos Dos Santos, H., Carvalho Nunes, J. B., Rodés, V., & Marques Queiroga, E. (2020). Mapping learning analytics initiatives in Latin America. British Journal of Educational Technology, 51(4), 892–914. https://doi.org/10.1111/bjet.12941
Chrysafiadi, K., & Virvou, M. (2013). Student modeling approaches: A literature review for the last decade. Expert Systems with Applications, 40(11), 4715–4729. https://doi.org/10.1016/j.eswa.2013.02.007
Cochran, K., Cohn, C., Hutchins, N., Biswas, G., & Hastings, P. (2022). Improving automated evaluation of formative assessments with text data augmentation. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial intelligence in education: 23rd international conference, AIED 2022, Durham, UK, July 27–31, 2022, proceedings, part I (pp. 390–401). Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_32
Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278. https://doi.org/10.1007/BF01099821
Crossley, S. A., Kim, M., Allen, L., & McNamara, D. (2019). Automated summarization evaluation (ASE) using natural language processing tools. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, & R. Luckin (Eds.), Artificial intelligence in education: 20th international conference, AIED 2019, Chicago, IL, USA, June 25–29, 2019, proceedings, part I (pp. 84–95). Springer International Publishing. https://doi.org/10.1007/978-3-030-23204-7_8
Cukurova, M., Khan-Galaria, M., Millán, E., & Luckin, R. (2022). A learning analytics approach to monitoring the quality of online one-to-one tutoring. Journal of Learning Analytics, 9(2), 105–120. https://doi.org/10.18608/jla.2022.7411
Dey, I., Gnesdilow, D., Passonneau, R., & Puntambekar, S. (2024). Potential pitfalls of false positives. In A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, & I. I. Bittencourt (Eds.), Artificial intelligence in education: Posters and late-breaking results, workshops and tutorials, industry and innovation tracks, practitioners, doctoral consortium and blue sky: 25th international conference, AIED 2024, Recife, Brazil, July 8–12, 2024, proceedings, part I (pp. 469–476). Springer Cham. https://doi.org/10.1007/978-3-031-64315-6_45
Feng, M., Heffernan, N., & Koedinger, K. (2009). Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction, 19(3), 243–266. https://doi.org/10.1007/s11257-009-9063-7
Funk, S. C., & Dickson, K. L. (2011). Multiple-choice and short-answer exam performance in a college classroom. Teaching of Psychology, 38(4), 273–277. https://doi.org/10.1177/0098628311421329
Gikandi, J. W., Morrow, D., & Davis, N. E. (2011). Online formative assessment in higher education: A review of the literature. Computers & Education, 57(4), 2333–2351. https://doi.org/10.1016/j.compedu.2011.06.004
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120. https://doi.org/10.1073/pnas.2305016120
Gurung, A., Vanacore, K., McReynolds, A. A., Ostrow, K. S., Worden, E., Sales, A. C., & Heffernan, N. T. (2024). Multiple choice vs. fill-in problems: The trade-off between scalability and learning. Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK ’24), 18–22 March 2024, Kyoto, Japan (pp. 507–517). https://doi.org/10.1145/3636555.3636908
Hahn, M. G., Navarro, S. M. B., De La Fuente Valentín, L., & Burgos, D. (2021). A systematic review of the effects of automatic scoring and automatic feedback in educational settings. IEEE Access, 9, 108190–108198. https://doi.org/10.1109/ACCESS.2021.3100890
Henkel, O. (2024, March 21). Rori - Quick intro [Video]. YouTube. https://www.youtube.com/watch?v=xXg6XRajbbk
Henkel, O., Hills, L., Roberts, B., & McGrane, J. (2024). Can LLMs grade open response reading comprehension questions? An empirical study using the ROARs dataset. International Journal of Artificial Intelligence in Education. https://doi.org/10.1007/s40593-024-00431-z
Hsu, S., Li, T. W., Zhang, Z., Fowler, M., Zilles, C., & Karahalios, K. (2021). Attitudes surrounding an imperfect AI autograder. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21), 8–13 May 2021, Yokohama, Japan (Article 681). https://doi.org/10.1145/3411764.3445424
Injeti, A. S., Rupsica, G. N., Reddy, G. P., Balakrishnan, R. M., & Pati, P. B. (2024). A machine learning-based classification of students’ algebraic responses using MathBERT embeddings. Proceedings of the 2024 5th International Conference for Emerging Technology (INCET), 24–26 May 2024, Belgaum, India (pp. 1–6). https://doi.org/10.1109/INCET61516.2024.10593432
Johnson, M., & Green, S. (2006). On-line mathematics assessment: The impact of mode on performance and question answering strategies. The Journal of Technology, Learning and Assessment, 4(5). https://ejournals.bc.edu/index.php/jtla/article/view/1652
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large language models are zero-shot reasoners. arXiv. https://doi.org/10.48550/arXiv.2205.11916
Kortemeyer, G. (2023). Performance of the pre-trained large language model GPT-4 on automated short answer grading. arXiv. https://doi.org/10.48550/arXiv.2309.09338
Lan, A. S., Vats, D., Waters, A. E., & Baraniuk, R. G. (2015). Mathematical language processing: Automatic grading and feedback for open response mathematical questions. Proceedings of the Second (2015) ACM Conference on Learning @ Scale (L@S ’15), 14–18 March 2015, Vancouver, BC, Canada (pp. 167–176). https://doi.org/10.1145/2724660.2724664
Magliano, J. P., & Graesser, A. C. (2012). Computer-based assessment of student-constructed responses. Behavior Research Methods, 44(3), 608–621. https://doi.org/10.3758/s13428-012-0211-3
Mayfield, E., & Black, A. W. (2020). Should you fine-tune BERT for automated essay scoring? Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, 10 July 2020, Seattle, WA, USA (pp. 151–162). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.bea-1.15
Morjaria, L., Burns, L., Bracken, K., Levinson, A. J., Ngo, Q. N., Lee, M., & Sibbald, M. (2024). Examining the efficacy of ChatGPT in marking short-answer assessments in an undergraduate medical program. International Medical Education, 3(1), 32–43. https://doi.org/10.3390/ime3010004
Motz, B. A., Bergner, Y., Brooks, C. A., Gladden, A., Gray, G., Lang, C., Li, W., Marmolejo-Ramos, F., & Quick, J. D. (2023). A LAK of direction: Misalignment between the goals of learning analytics and its research scholarship. Journal of Learning Analytics, 10(2), 1–13. https://doi.org/10.18608/jla.2023.7913
Nguyen, H. A., Hou, X., Stamper, J., McLaren, B. M. (2020). Moving beyond test scores: Analyzing the effectiveness of a digital learning game through learning analytics. Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020), 10–13 July 2020, Online (pp. 487–495). International Educational Data Mining Society.
O’Neil, H. F., Jr., & Brown, R. S. (1998). Differential effects of question formats in math assessment on metacognition and affect. Applied Measurement in Education, 11(4), 331–351. https://doi.org/10.1207/s15324818ame1104_3
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv. https://doi.org/10.48550/arXiv.2203.02155
Pelánek, R. (2015). Metrics for evaluation of student models. Journal of Educational Data Mining, 7(2), 1–19. https://doi.org/10.5281/zenodo.3554665
Pulman, S. G., & Sukkarieh, J. Z. (2005). Automatic short answer marking. Proceedings of the Second Workshop on Building Educational Applications Using NLP (EdAppsNLP 05), 29 June 2005, Ann Arbor, MI, USA (pp. 9–16). Association for Computational Linguistics https://doi.org/10.3115/1609829.1609831
Rajendran, R., Iyer, S., & Murthy, S. (2019). Personalized affective feedback to address students’ frustration in ITS. IEEE Transactions on Learning Technologies, 12(1), 87–97. https://doi.org/10.1109/TLT.2018.2807447
Rising Academies. (2024). Rori [Software]. Available from https://rori.ai/
Rupp, A. A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-choice questions shapes the construct: A cognitive processing perspective. Language Testing, 23(4), 441–474. https://doi.org/10.1191/0265532206lt337oa
Schneider, J., Schenk, B., & Niklaus, C. (2024). Towards LLM-based autograding for short textual answers. Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024), 2–4 May 2024, Angers, France (pp. 280–288). SciTePress. https://doi.org/10.5220/0012552200003693
Shen, J. T., Yamashita, M., Prihar, E., Heffernan, N., Wu, X., Graff, B., & Lee, D. (2023). MathBERT: A pre-trained language model for general NLP tasks in mathematics education. arXiv. https://doi.org/10.48550/arXiv.2106.07340
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2022). Learning to summarize from human feedback. arXiv. https://doi.org/10.48550/arXiv.2009.01325
Sung, C., Dhamecha, T., Saha, S., Ma, T., Reddy, V., & Arora, R. (2019). Pre-training BERT on domain resources for short answer grading. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3–7 November 2019, Hong Kong, China (pp. 6070–6074). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1628
USAID. (2019). Global proficiency framework: Reading and mathematics. United States Agency for International Development. https://www.edu-links.org/resources/global-proficiency-frameworkreading-and-mathematics
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of large language models. arXiv. https://doi.org/10.48550/arXiv.2206.07682
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Learning Analytics

This work is licensed under a Creative Commons Attribution 4.0 International License.