Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset

Owen Henkel; Hannah  Horne-Robinson; Maria Dyshel; Greg Thompson; Ralph Abboud; Nabil Al Nahin Ch; Baptise Moreau-Pernet; Kirk Vanacore

doi:10.18608/jla.2025.8621

Authors

Owen Henkel University of Oxford https://orcid.org/0009-0001-8850-067X
Hannah Horne-Robinson Rising Academies
Maria Dyshel TangibleAI
Greg Thompson TangibleAI
Ralph Abboud Learning Engineering Virtual Institute https://orcid.org/0000-0002-2332-0504
Nabil Al Nahin Ch University of Minnesota https://orcid.org/0000-0002-0202-1724
Baptise Moreau-Pernet Digital Harbor Foundation https://orcid.org/0009-0006-9424-455X
Kirk Vanacore University of Pennsylvania https://orcid.org/0000-0003-0673-5721

DOI:

https://doi.org/10.18608/jla.2025.8621

Keywords:

large language models (LLMs), formative assessment, math education, research paper

Abstract

This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach — chain-of-thought prompting — accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems.

References

Abdelrahman, G., Wang, Q., & Nunes, B. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), 224. https://doi.org/10.1145/3569576

Allen, L. K., Snow, E. L., Crossley, S. A., Tanner Jackson, G., & McNamara, D. S. (2014). Reading comprehension components and their relation to writing. L’Année Psychologique, 114(4), 663–691. https://doi.org/10.4074/S0003503314004047

Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27(1), 3–23. https://doi.org/10.2307/3315487

Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92(1), 81–90. https://doi.org/10.1177/003172171009200119

Botelho, A., Baral, S., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2023). Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. Journal of Computer Assisted Learning, 39(3), 823–840. https://doi.org/10.1111/jcal.12793

Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117. https://doi.org/10.1007/s40593-014-0026-8

Cechinel, C., Ochoa, X., Lemos Dos Santos, H., Carvalho Nunes, J. B., Rodés, V., & Marques Queiroga, E. (2020). Mapping learning analytics initiatives in Latin America. British Journal of Educational Technology, 51(4), 892–914. https://doi.org/10.1111/bjet.12941

Chrysafiadi, K., & Virvou, M. (2013). Student modeling approaches: A literature review for the last decade. Expert Systems with Applications, 40(11), 4715–4729. https://doi.org/10.1016/j.eswa.2013.02.007

Cochran, K., Cohn, C., Hutchins, N., Biswas, G., & Hastings, P. (2022). Improving automated evaluation of formative assessments with text data augmentation. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial intelligence in education: 23rd international conference, AIED 2022, Durham, UK, July 27–31, 2022, proceedings, part I (pp. 390–401). Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_32

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278. https://doi.org/10.1007/BF01099821

Crossley, S. A., Kim, M., Allen, L., & McNamara, D. (2019). Automated summarization evaluation (ASE) using natural language processing tools. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, & R. Luckin (Eds.), Artificial intelligence in education: 20th international conference, AIED 2019, Chicago, IL, USA, June 25–29, 2019, proceedings, part I (pp. 84–95). Springer International Publishing. https://doi.org/10.1007/978-3-030-23204-7_8

Cukurova, M., Khan-Galaria, M., Millán, E., & Luckin, R. (2022). A learning analytics approach to monitoring the quality of online one-to-one tutoring. Journal of Learning Analytics, 9(2), 105–120. https://doi.org/10.18608/jla.2022.7411

Dey, I., Gnesdilow, D., Passonneau, R., & Puntambekar, S. (2024). Potential pitfalls of false positives. In A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, & I. I. Bittencourt (Eds.), Artificial intelligence in education: Posters and late-breaking results, workshops and tutorials, industry and innovation tracks, practitioners, doctoral consortium and blue sky: 25th international conference, AIED 2024, Recife, Brazil, July 8–12, 2024, proceedings, part I (pp. 469–476). Springer Cham. https://doi.org/10.1007/978-3-031-64315-6_45

Feng, M., Heffernan, N., & Koedinger, K. (2009). Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction, 19(3), 243–266. https://doi.org/10.1007/s11257-009-9063-7

Funk, S. C., & Dickson, K. L. (2011). Multiple-choice and short-answer exam performance in a college classroom. Teaching of Psychology, 38(4), 273–277. https://doi.org/10.1177/0098628311421329

Gikandi, J. W., Morrow, D., & Davis, N. E. (2011). Online formative assessment in higher education: A review of the literature. Computers & Education, 57(4), 2333–2351. https://doi.org/10.1016/j.compedu.2011.06.004

Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120. https://doi.org/10.1073/pnas.2305016120

Gurung, A., Vanacore, K., McReynolds, A. A., Ostrow, K. S., Worden, E., Sales, A. C., & Heffernan, N. T. (2024). Multiple choice vs. fill-in problems: The trade-off between scalability and learning. Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK ’24), 18–22 March 2024, Kyoto, Japan (pp. 507–517). https://doi.org/10.1145/3636555.3636908

Hahn, M. G., Navarro, S. M. B., De La Fuente Valentín, L., & Burgos, D. (2021). A systematic review of the effects of automatic scoring and automatic feedback in educational settings. IEEE Access, 9, 108190–108198. https://doi.org/10.1109/ACCESS.2021.3100890

Henkel, O. (2024, March 21). Rori - Quick intro [Video]. YouTube. https://www.youtube.com/watch?v=xXg6XRajbbk

Henkel, O., Hills, L., Roberts, B., & McGrane, J. (2024). Can LLMs grade open response reading comprehension questions? An empirical study using the ROARs dataset. International Journal of Artificial Intelligence in Education. https://doi.org/10.1007/s40593-024-00431-z

Hsu, S., Li, T. W., Zhang, Z., Fowler, M., Zilles, C., & Karahalios, K. (2021). Attitudes surrounding an imperfect AI autograder. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21), 8–13 May 2021, Yokohama, Japan (Article 681). https://doi.org/10.1145/3411764.3445424

Injeti, A. S., Rupsica, G. N., Reddy, G. P., Balakrishnan, R. M., & Pati, P. B. (2024). A machine learning-based classification of students’ algebraic responses using MathBERT embeddings. Proceedings of the 2024 5th International Conference for Emerging Technology (INCET), 24–26 May 2024, Belgaum, India (pp. 1–6). https://doi.org/10.1109/INCET61516.2024.10593432

Johnson, M., & Green, S. (2006). On-line mathematics assessment: The impact of mode on performance and question answering strategies. The Journal of Technology, Learning and Assessment, 4(5). https://ejournals.bc.edu/index.php/jtla/article/view/1652

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large language models are zero-shot reasoners. arXiv. https://doi.org/10.48550/arXiv.2205.11916

Kortemeyer, G. (2023). Performance of the pre-trained large language model GPT-4 on automated short answer grading. arXiv. https://doi.org/10.48550/arXiv.2309.09338

Lan, A. S., Vats, D., Waters, A. E., & Baraniuk, R. G. (2015). Mathematical language processing: Automatic grading and feedback for open response mathematical questions. Proceedings of the Second (2015) ACM Conference on Learning @ Scale (L@S ’15), 14–18 March 2015, Vancouver, BC, Canada (pp. 167–176). https://doi.org/10.1145/2724660.2724664

Magliano, J. P., & Graesser, A. C. (2012). Computer-based assessment of student-constructed responses. Behavior Research Methods, 44(3), 608–621. https://doi.org/10.3758/s13428-012-0211-3

Mayfield, E., & Black, A. W. (2020). Should you fine-tune BERT for automated essay scoring? Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, 10 July 2020, Seattle, WA, USA (pp. 151–162). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.bea-1.15

Morjaria, L., Burns, L., Bracken, K., Levinson, A. J., Ngo, Q. N., Lee, M., & Sibbald, M. (2024). Examining the efficacy of ChatGPT in marking short-answer assessments in an undergraduate medical program. International Medical Education, 3(1), 32–43. https://doi.org/10.3390/ime3010004

Motz, B. A., Bergner, Y., Brooks, C. A., Gladden, A., Gray, G., Lang, C., Li, W., Marmolejo-Ramos, F., & Quick, J. D. (2023). A LAK of direction: Misalignment between the goals of learning analytics and its research scholarship. Journal of Learning Analytics, 10(2), 1–13. https://doi.org/10.18608/jla.2023.7913

Nguyen, H. A., Hou, X., Stamper, J., McLaren, B. M. (2020). Moving beyond test scores: Analyzing the effectiveness of a digital learning game through learning analytics. Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020), 10–13 July 2020, Online (pp. 487–495). International Educational Data Mining Society.

O’Neil, H. F., Jr., & Brown, R. S. (1998). Differential effects of question formats in math assessment on metacognition and affect. Applied Measurement in Education, 11(4), 331–351. https://doi.org/10.1207/s15324818ame1104_3

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv. https://doi.org/10.48550/arXiv.2203.02155

Pelánek, R. (2015). Metrics for evaluation of student models. Journal of Educational Data Mining, 7(2), 1–19. https://doi.org/10.5281/zenodo.3554665

Pulman, S. G., & Sukkarieh, J. Z. (2005). Automatic short answer marking. Proceedings of the Second Workshop on Building Educational Applications Using NLP (EdAppsNLP 05), 29 June 2005, Ann Arbor, MI, USA (pp. 9–16). Association for Computational Linguistics https://doi.org/10.3115/1609829.1609831

Rajendran, R., Iyer, S., & Murthy, S. (2019). Personalized affective feedback to address students’ frustration in ITS. IEEE Transactions on Learning Technologies, 12(1), 87–97. https://doi.org/10.1109/TLT.2018.2807447

Rising Academies. (2024). Rori [Software]. Available from https://rori.ai/

Rupp, A. A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-choice questions shapes the construct: A cognitive processing perspective. Language Testing, 23(4), 441–474. https://doi.org/10.1191/0265532206lt337oa

Schneider, J., Schenk, B., & Niklaus, C. (2024). Towards LLM-based autograding for short textual answers. Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024), 2–4 May 2024, Angers, France (pp. 280–288). SciTePress. https://doi.org/10.5220/0012552200003693

Shen, J. T., Yamashita, M., Prihar, E., Heffernan, N., Wu, X., Graff, B., & Lee, D. (2023). MathBERT: A pre-trained language model for general NLP tasks in mathematics education. arXiv. https://doi.org/10.48550/arXiv.2106.07340

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2022). Learning to summarize from human feedback. arXiv. https://doi.org/10.48550/arXiv.2009.01325

Sung, C., Dhamecha, T., Saha, S., Ma, T., Reddy, V., & Arora, R. (2019). Pre-training BERT on domain resources for short answer grading. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3–7 November 2019, Hong Kong, China (pp. 6070–6074). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1628

USAID. (2019). Global proficiency framework: Reading and mathematics. United States Agency for International Development. https://www.edu-links.org/resources/global-proficiency-frameworkreading-and-mathematics

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of large language models. arXiv. https://doi.org/10.48550/arXiv.2206.07682

Learning to Love LLMs for Answer Interpretation

Chain-of-Thought Prompting and the AMMORE Dataset

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License