Learning-Aware Reliability Estimation for Tutor Skill Assessment Using Large Language Models
DOI:
https://doi.org/10.18608/jla.2026.9133Keywords:
reliability, large language models, assessment, short answer grading, tutor training, online learningAbstract
Assessment is foundational to learning analytics, especially in evaluating instructional interventions and guiding improvement in online learning environments. With the growing use of large language models (LLMs) to score open-ended responses, questions arise about the reliability of these model-generated scores, particularly in short pre-post formats where learners are expected to improve. This study introduces a novel method for estimating test reliability that adjusts for learning gains using a Rasch-based split-half approach. We validated this approach through simulation under realistic conditions of missing data and score change, showing tangible improvements in reliability estimation compared to baseline methods. Applying this method to a dataset of 985 tutors completing 12 online lessons, we find that GPT-4-based scoring achieves satisfactory reliability, with open-ended responses (0.733) outperforming multiple-choice items (0.652). Both item types jointly yielded the highest reliability (0.774). Hence, as few as 14 open-ended items (across an average of 3-4 completed lessons) were sufficient to surpass common reliability thresholds of 0.7 or higher. Principal component analysis revealed a skill structure with a strong primary dimension shared across almost all lessons and interpretable subdimensions—socio-emotional, cognitive, and fairness-related tutoring skills—supporting a bifactor-like model. These findings demonstrate that GPT-4 and similar LLMs can be effectively used for formative assessment of complex instructional skills in online and personalized learning contexts, provided their reliability is empirically verified. This study contributes an open-source, learning-aware framework for scalable and reliable AI-supported assessment in learning analytics contexts.
References
Abdelrahman, G., Wang, Q., & Nunes, B. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), 1–37. https://doi.org/10.1145/3569576
Baker, R. S. (2025). Big data and education (9th ed.). University of Pennsylvania.
Bannigan, K., & Watson, R. (2009). Reliability and validity in a nutshell. Journal of Clinical Nursing, 18(23), 3237–3243. https://doi.org/10.1111/j.1365-2702.2009.02939.x
Beland, S., Pichette, F., & Jolani, S. (2016). Impact on Cronbach’s α of simple treatment methods for missing data. The Quantitative Methods for Psychology, 12(1), 57–73. https://doi.org/10.20982/tqmp.12.1.p057
Bergner, Y. (2017). Measurement and its uses in learning analytics. In Handbook of learning analytics (1st ed.). Society for Learning Analytics Research (SoLAR). https://doi.org/10.18608/hla17.003
Bhandari, S., Liu, Y., Kwak, Y., & Pardos, Z. A. (2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, 100284. https://doi.org/10.1016/j.caeai.2024.100284
Bond, T. G., & Fox, C. M. (2013). Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press. https://doi.org/10.4324/9781315814698
Borchers, C., Thomas, D. R., Lin, J., Abboud, R., & Koedinger, K. R. (2025). Augmenting human-annotated training data with large language model generation and distillation in open-response assessment. arXiv preprint arXiv:2501.09126. https://doi.org/10.48550/arXiv.2501.09126
Borchers, C., Wang, Y., Karumbaiah, S., Ashiq, M., Shaffer, D. W., & Aleven, V. (2024). Revealing networks: Understanding effective teacher practices in AI-supported classrooms using transmodal ordered network analysis. In Proceedings of the 14th International Conference on Learning Analytics and Knowledge (LAK 2024), 18–22 March 2024, Tokyo, Japan (pp. 371–381). ACM. https://doi.org/10.1145/3636555.3636892
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 1904-1920, 3(3), 296–322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.x
Butler, A. C. (2018). Multiple-choice testing in education: Are the best practices for assessment also good for learning? Journal of Applied Research in Memory and Cognition, 7(3), 323–331. https://doi.org/10.1016/j.jarmac.2018.07.002
Cao, J., Zhao, C. Q., Chen, X., Wang, S., Schunn, C., Koedinger, K. R., & Lin, J. (2025). From first draft to final insight: A multi-agent approach for feedback generation. arXiv preprint arXiv:2505.04869. https://doi.org/10.48550/arXiv.2505.04869
Cen, H., Koedinger, K., & Junker, B. (2006). Learning factors analysis—A general method for cognitive model evaluation and improvement. In M. Ikeda, K. Ashley, & T. Chan (Eds.), Intelligent tutoring systems. ITS 2006. Lecture notes in computer science (pp. 164–175, Vol. 4054). Springer. https://doi.org/10.1007/11774303_17
Chen, L., Zechner, K., Yoon, S. - Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis, L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C. L., Leong, C. W., & Gyawali, B. (2018). Automated scoring of nonnative speech using the SpeechRaterSM v.5.0 engine. ETS Research Report Series, 2018(1), 1–31. https://doi.org/10.1002/ets2.12198
Chhabra, P., Chine, D., Adeniran, A., Gupta, S., & Koedinger, K. (2022). An evaluation of perceptions regarding mentor competencies for technology-based personalized learning. In Proceedings of the 2022 Society for Information Technology & Teacher Education International Conference, 11 April 2022, San Diego, California, USA (pp. 1812–1817). Association for the Advancement of Computing in Education (AACE). https://www.learntechlib.org/primary/p/220956/
Chine, D. R., Chhabra, P., Adeniran, A., Gupta, S., & Koedinger, K. R. (2022). Development of scenario-based mentor lessons: An iterative design process for training at scale. In Proceedings of the Ninth ACM Conference on Learning at Scale (L@S 2022), 1–3 June 2022, New York, New York, USA (pp. 469–471). ACM. https://doi.org/10.1145/3491140.3528262
Chinn, S. (2000). A simple method for converting an odds ratio to effect size for use in meta-analysis. Statistics in Medicine, 19(22), 3127–3131. https://doi.org/10.1002/1097-0258(20001130)19:22⟨3127::AID-SIM784⟩3.0.CO;2-M
Dai, W., Tsai, Y.- S., Lin, J., Aldino, A., Jin, H., Li, T., Gasevic, D., & Chen, G. (2024). Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. Computers and Education: Artificial Intelligence, 7, 100299. https://doi.org/10.1016/j.caeai.2024.100299
De Ayala, R. J. (2013). The theory and practice of item response theory. Guilford Publications.
DeVellis, R. F. (2006). Classical test theory. Medical Care, 44(11), S50–S59. https://doi.org/10.1097/01.mlr.0000245426.10853.30
Dinno, A. (2009). Implementing Horn’s parallel analysis for principal component analysis and factor analysis. The Stata Journal, 9(2), 291–298. https://doi.org/10.1177/1536867X0900900207
Divjak, B., Svetec, B., Horvat, D., & Kadoic, N. (2023). Assessment validity and learning analytics as prerequisites for ensuring student-centred learning design. British Journal of Educational Technology, 54(1), 313–334. https://doi.org/10.1111/bjet.13290
Fischer, C., Pardos, Z. A., Baker, R. S., Williams, J. J., Smyth, P., Yu, R., Slater, S., Baker, R., & Warschauer, M. (2020). Mining big data in education: Affordances and challenges. Review of Research in Education, 44(1), 130–160. https://doi.org/10.3102/0091732X20903304
Gibbs, G. (1988). Learning by doing: A guide to teaching and learning methods. Further Education Unit.
Gliem, J. A., & Gliem, R. R. (2003). Calculating, interpreting, and reporting Cronbach’s alpha reliability coefficient for Likert-type scales. In Proceedings of the 2003 Midwest Research-to-Practice Conference in Adult, Continuing, and Community Education, 8–10 October 2003, Columbus, Ohio (pp. 82–88). ScholarWorks Indianapolis. https://hdl.handle.net/1805/344
Grevisse, C. (2024). LLM-based automatic short answer grading in undergraduate medical education. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-06026-5
Gurung, A., Vanacore, K., McReynolds, A. A., Ostrow, K. S., Worden, E., Sales, A. C., & Heffernan, N. T. (2024). Multiple choice vs. fill-in problems: The trade-off between scalability and learning. In Proceedings of the 14th International Conference on Learning Analytics and Knowledge (LAK 2024), 18–22 March 2024, Tokyo, Japan (pp. 507–517). ACM. https://doi.org/10.1145/3636555.3636908
Gwet, K. (2001). Handbook of inter-rater reliability. STATAXIS Publishing Company.
Jang, M., & Lukasiewicz, T. (2023). Consistency analysis of ChatGPT. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 6–10 December 2023, Singapore (pp. 15970–15985). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.991
Josse, J., & Husson, F. (2012). Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics & Data Analysis, 56(6), 1869–1879. https://doi.org/10.1016/j.csda.2011.11.012
Koedinger, K. R., Baker, R. S., Cunningham, K., Skogsholm, A., Leber, B., & Stamper, J. (2010). A data repository for the EDM community: The PSLC DataShop. In C. Romero, S. Ventura, M. Pechenizkiy, & R. Baker (Eds.), Handbook of educational data mining (pp. 43–56). CRC Press. https://doi.org/10.1201/b10274-10
Lin, J., Chen, E., Han, Z., Gurung, A., Thomas, D. R., Tan, W., Nguyen, N. D., & Koedinger, K. R. (2024). How can I improve Using GPT to highlight the desired and undesired parts of open-ended responses. arXiv preprint arXiv:2405.00291. https://doi.org/10.48550/arXiv.2405.00291
Lin, J., Han, Z., Thomas, D. R., Gurung, A., Gupta, S., Aleven, V., & Koedinger, K. R. (2025). How can I get it right? Using GPT to rephrase incorrect trainee responses. International Journal of Artificial Intelligence in Education, 35, 482–508. https://doi.org/10.1007/s40593-024-00408-y
Liu, O. L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M. C. (2014). Automated scoring of constructed-response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28. https://doi.org/10.1111/emip.12028
Liu, R., & Koedinger, K. R. (2017). Towards reliable and valid measurement of individualized student parameters. In X. Hu, T. Barnes, A. Hershkovitz, & L. Paquette (Eds.), Proceedings of the 10th International Conference on Educational Data Mining (EDM 2017), 25–28 June 2017, Wuhan, China (pp. 135–142). International Educational Data Mining Society. https://educationaldatamining.org/EDM2017/proc files/proceedings.pdf
Liu, Y., Yao, Y., Ton, J.- F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M. F., & Li, H. (2023). Trustworthy LLMs: A survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374. https://doi.org/10.48550/arXiv.2308.05374
Mangaroska, K., & Giannakos, M. (2018). Learning analytics for learning design: A systematic literature review of analytics-driven design to enhance learning. IEEE Transactions on Learning Technologies, 12(4), 516–534. https://doi.org/10.1109/TLT.2018.2868673
Misiejuk, K., López-Pernas, S., Kaliisa, R., & Saqr, M. (2025). Mapping the landscape of generative artificial intelligence in learning analytics: A systematic literature review. Journal of Learning Analytics, 12(1), 12–31. https://doi.org/10.18608/jla.2025.8591
National Student Support Accelerator. (2023). Toolkit for tutoring programs. https://nssa.stanford.edu/tutoring
National Student Support Accelerator. (2025). Tutoring quality standards. https://nssa.stanford.edu/tqis/quality-standards
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.
O’Neil, S., & Schotka, R. (2020). CRLA ITTPC standards, outcomes, and assessments (2nd ed.). College Reading and Learning Association. https://cdn.ymaws.com/crla.net/resource/resmgr/soas/crla ittpc standards outcome.pdf
Pardos, Z. A., & Heffernan, N. T. (2010). Modeling individualization in a Bayesian networks implementation of knowledge tracing. In P. De Bra, A. Kobsa, & D. Chin (Eds.), User modeling, adaptation, and personalization. UMAP 2010. Lecture notes in computer science (pp. 255–266, Vol. 6075). Springer. https://doi.org/10.1007/978-3-642-13470-8_24
Pavlik, P. I., Cen, H., & Koedinger, K. R. (2009). Performance factors analysis: A new alternative to knowledge tracing. Frontiers in Artificial Intelligence and Applications, 200(1), 531–538. https://doi.org/10.3233/978-1-60750-028-5-531
Post, M. W. (2016). What to do with “moderate” reliability and validity coefficients? Archives of Physical Medicine and Rehabilitation, 97(7), 1051–1052. https://doi.org/10.1016/j.apmr.2016.04.001
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555
Robinson, C. D. (2022). A framework for motivating teacher-student relationships. Educational Psychology Review, 34(4), 2061–2094. https://doi.org/10.1007/s10648-022-09706-0
Saga Coach. (2025). Saga Coach Program. https://saga.org/products/saga-coach/
Saga Education. (2021, May). National tutoring nonprofit launches free, online training to help scale tutoring programs [press release]. https://saga.org/national-tutoring-nonprofit-launches-free-online-training-to-help-scale-tutoring-programs/
Scheffel, M., Drachsler, H., Toisoul, C., Ternier, S., & Specht, M. (2017). The proof of the pudding: Examining validity and reliability of the evaluation framework for learning analytics. In E. Lavoué, H. Drachsler, K. Verbert, J. Broisin, & M. P. Pérez-Sanagustın (Eds.), Data-driven approaches in digital education. EC-TEL 2017. Lecture notes in computer science (pp. 194–208, Vol. 10474). Springer. https://doi.org/10.1007/978-3-319-66610-5_15
Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Conference on Learning Analytics and Knowledge (LAK 2025), 3–7 March 2025, Dublin, Ireland (pp. 462–472). ACM. https://doi.org/10.1145/3706468.3706527
Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271. https://doi.org/10.1111/j.2044-8295.1910.tb00206.x
Thomas, D., Yang, X., Gupta, S., Adeniran, A., Mclaughlin, E., & Koedinger, K. (2023). When the tutor becomes the student: Design and evaluation of efficient scenario-based lessons for tutors. In Proceedings of the 13th International Conference on Learning Analytics and Knowledge (LAK 2023), 13–17 March 2023, Arlington, Texas, USA (pp. 250–261). ACM. https://doi.org/10.1145/3576050.3576089
Thomas, D. R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., & Koedinger, K. R. (2025a). Do tutors learn from equity training and can generative AI assess it? In Proceedings of the 15th International Conference on Learning Analytics and Knowledge (LAK 2025), 3–7 March 2025, Dublin, Ireland (pp. 505–515). ACM. https://doi.org/10.1145/3706468.3706531
Thomas, D. R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., & Koedinger, K. R. (2025b). Does multiple choice have a future in the age of generative AI? A posttest-only RCT. In Proceedings of the 15th International Conference on Learning Analytics and Knowledge (LAK 2025), 3–7 March 2025, Dublin, Ireland (pp. 494–504). ACM. https://doi.org/10.1145/3706468.3706530
Walker, D. A. (2005). A comparison of the Spearman-Brown and Flanagan-Rulon formulas for split half reliability under various variance parameter conditions. Journal of Modern Applied Statistical Methods, 5(2). https://doi.org/10.22237/jmasm/1162354620
Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., & Li, J. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for llms. NPJ Digital Medicine, 7(1). https://doi.org/10.1038/s41746-024-01029-4
Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2024). Tutor CoPilot: A human-AI approach for scaling real-time expertise. Research Square preprint. https://doi.org/10.21203/rs.3.rs-5363154/v1
Warrens, M. J. (2015). On Cronbach’s alpha as the mean of all split-half reliabilities. In R. Millsap, D. Bolt, L. van der Ark, & W. Wang (Eds.), Quantitative psychology research. Springer proceedings in mathematics & statistics (pp. 293–300, Vol. 89). Springer. https://doi.org/10.1007/978-3-319-07503-7_18
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.52202/068431-1800
Winne, P. H. (2020). Construct and consequential validity for learning analytics based on trace data. Computers in Human Behavior, 112, 106457. https://doi.org/10.1016/j.chb.2020.106457
Wise, A. F., Knight, S., & Ochoa, X. (2021). What makes learning analytics research matter. Journal of Learning Analytics, 8(3), 1–9. https://doi.org/10.18608/jla.2021.7647
Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRaterSM v1.0. ETS Research Report Series, 2008(2), i–102. https://doi.org/10.1002/j.2333-8504.2008.tb02148.x
Yan, L., Martinez-Maldonado, R., & Gasevic, D. (2024). Generative artificial intelligence in learning analytics: Contextualising opportunities and challenges through the learning analytics cycle. In Proceedings of the 14th International Conference on Learning Analytics and Knowledge (LAK 2024), 18–22 March 2024, Tokyo, Japan (pp. 101–111). ACM. https://doi.org/10.1145/3636555.3636856
Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutorials in Quantitative Methods for Psychology, 9(2), 79–94. https://doi.org/10.20982/tqmp.09.2.p079
Yoon, S.- Y., & Zechner, K. (2017). Combining human and automated scores for the improved assessment of non-native speech. Speech Communication, 93, 43–52. https://doi.org/10.1016/j.specom.2017.08.001
Zhang, D. -W., Boey, M., Tan, Y. Y., & Jia, A. H. S. (2024). Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Science of Learning, 9(1), 79. https://doi.org/10.1038/s41539-024-00291-1
Zhang, L., Lin, J., Borchers, C., Cao, M., & Hu, X. (2024). 3DG: A framework for using generative AI for handling sparse learner performance data from intelligent tutoring systems. arXiv preprint arXiv:2402.01746. https://doi.org/10.48550/arXiv.2402.01746
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Learning Analytics

This work is licensed under a Creative Commons Attribution 4.0 International License.