Learning-Aware Reliability Estimation for Tutor Skill Assessment Using Large Language Models

Conrad Borchers; Danielle R. Thomas; Jionghao Lin; Kenneth R. Koedinger

doi:10.18608/jla.2026.9133

Authors

Conrad Borchers Carnegie Mellon University https://orcid.org/0000-0003-3437-8979
Danielle R. Thomas Carnegie Mellon University
Jionghao Lin University of Hong Kong https://orcid.org/0000-0003-3320-3907
Kenneth R. Koedinger Carnegie Mellon University

DOI:

https://doi.org/10.18608/jla.2026.9133

Keywords:

reliability, large language models, assessment, short answer grading, tutor training, online learning

Abstract

Assessment is foundational to learning analytics, especially in evaluating instructional interventions and guiding improvement in online learning environments. With the growing use of large language models (LLMs) to score open-ended responses, questions arise about the reliability of these model-generated scores, particularly in short pre-post formats where learners are expected to improve. This study introduces a novel method for estimating test reliability that adjusts for learning gains using a Rasch-based split-half approach. We validated this approach through simulation under realistic conditions of missing data and score change, showing tangible improvements in reliability estimation compared to baseline methods. Applying this method to a dataset of 985 tutors completing 12 online lessons, we find that GPT-4-based scoring achieves satisfactory reliability, with open-ended responses (0.733) outperforming multiple-choice items (0.652). Both item types jointly yielded the highest reliability (0.774). Hence, as few as 14 open-ended items (across an average of 3-4 completed lessons) were sufficient to surpass common reliability thresholds of 0.7 or higher. Principal component analysis revealed a skill structure with a strong primary dimension shared across almost all lessons and interpretable subdimensions—socio-emotional, cognitive, and fairness-related tutoring skills—supporting a bifactor-like model. These findings demonstrate that GPT-4 and similar LLMs can be effectively used for formative assessment of complex instructional skills in online and personalized learning contexts, provided their reliability is empirically verified. This study contributes an open-source, learning-aware framework for scalable and reliable AI-supported assessment in learning analytics contexts.

References

Abdelrahman, G., Wang, Q., & Nunes, B. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), 1–37. https://doi.org/10.1145/3569576

Baker, R. S. (2025). Big data and education (9th ed.). University of Pennsylvania.

Bannigan, K., & Watson, R. (2009). Reliability and validity in a nutshell. Journal of Clinical Nursing, 18(23), 3237–3243. https://doi.org/10.1111/j.1365-2702.2009.02939.x

Beland, S., Pichette, F., & Jolani, S. (2016). Impact on Cronbach’s α of simple treatment methods for missing data. The Quantitative Methods for Psychology, 12(1), 57–73. https://doi.org/10.20982/tqmp.12.1.p057

Bergner, Y. (2017). Measurement and its uses in learning analytics. In Handbook of learning analytics (1st ed.). Society for Learning Analytics Research (SoLAR). https://doi.org/10.18608/hla17.003

Bhandari, S., Liu, Y., Kwak, Y., & Pardos, Z. A. (2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, 100284. https://doi.org/10.1016/j.caeai.2024.100284

Bond, T. G., & Fox, C. M. (2013). Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press. https://doi.org/10.4324/9781315814698

Borchers, C., Thomas, D. R., Lin, J., Abboud, R., & Koedinger, K. R. (2025). Augmenting human-annotated training data with large language model generation and distillation in open-response assessment. arXiv preprint arXiv:2501.09126. https://doi.org/10.48550/arXiv.2501.09126

Borchers, C., Wang, Y., Karumbaiah, S., Ashiq, M., Shaffer, D. W., & Aleven, V. (2024). Revealing networks: Understanding effective teacher practices in AI-supported classrooms using transmodal ordered network analysis. In Proceedings of the 14th International Conference on Learning Analytics and Knowledge (LAK 2024), 18–22 March 2024, Tokyo, Japan (pp. 371–381). ACM. https://doi.org/10.1145/3636555.3636892

Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 1904-1920, 3(3), 296–322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.x

Butler, A. C. (2018). Multiple-choice testing in education: Are the best practices for assessment also good for learning? Journal of Applied Research in Memory and Cognition, 7(3), 323–331. https://doi.org/10.1016/j.jarmac.2018.07.002

Cao, J., Zhao, C. Q., Chen, X., Wang, S., Schunn, C., Koedinger, K. R., & Lin, J. (2025). From first draft to final insight: A multi-agent approach for feedback generation. arXiv preprint arXiv:2505.04869. https://doi.org/10.48550/arXiv.2505.04869

Cen, H., Koedinger, K., & Junker, B. (2006). Learning factors analysis—A general method for cognitive model evaluation and improvement. In M. Ikeda, K. Ashley, & T. Chan (Eds.), Intelligent tutoring systems. ITS 2006. Lecture notes in computer science (pp. 164–175, Vol. 4054). Springer. https://doi.org/10.1007/11774303_17

Chen, L., Zechner, K., Yoon, S. - Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis, L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C. L., Leong, C. W., & Gyawali, B. (2018). Automated scoring of nonnative speech using the SpeechRaterSM v.5.0 engine. ETS Research Report Series, 2018(1), 1–31. https://doi.org/10.1002/ets2.12198

Chhabra, P., Chine, D., Adeniran, A., Gupta, S., & Koedinger, K. (2022). An evaluation of perceptions regarding mentor competencies for technology-based personalized learning. In Proceedings of the 2022 Society for Information Technology & Teacher Education International Conference, 11 April 2022, San Diego, California, USA (pp. 1812–1817). Association for the Advancement of Computing in Education (AACE). https://www.learntechlib.org/primary/p/220956/

Chine, D. R., Chhabra, P., Adeniran, A., Gupta, S., & Koedinger, K. R. (2022). Development of scenario-based mentor lessons: An iterative design process for training at scale. In Proceedings of the Ninth ACM Conference on Learning at Scale (L@S 2022), 1–3 June 2022, New York, New York, USA (pp. 469–471). ACM. https://doi.org/10.1145/3491140.3528262

Chinn, S. (2000). A simple method for converting an odds ratio to effect size for use in meta-analysis. Statistics in Medicine, 19(22), 3127–3131. https://doi.org/10.1002/1097-0258(20001130)19:22⟨3127::AID-SIM784⟩3.0.CO;2-M

Dai, W., Tsai, Y.- S., Lin, J., Aldino, A., Jin, H., Li, T., Gasevic, D., & Chen, G. (2024). Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. Computers and Education: Artificial Intelligence, 7, 100299. https://doi.org/10.1016/j.caeai.2024.100299

De Ayala, R. J. (2013). The theory and practice of item response theory. Guilford Publications.

DeVellis, R. F. (2006). Classical test theory. Medical Care, 44(11), S50–S59. https://doi.org/10.1097/01.mlr.0000245426.10853.30

Dinno, A. (2009). Implementing Horn’s parallel analysis for principal component analysis and factor analysis. The Stata Journal, 9(2), 291–298. https://doi.org/10.1177/1536867X0900900207

Divjak, B., Svetec, B., Horvat, D., & Kadoic, N. (2023). Assessment validity and learning analytics as prerequisites for ensuring student-centred learning design. British Journal of Educational Technology, 54(1), 313–334. https://doi.org/10.1111/bjet.13290

Fischer, C., Pardos, Z. A., Baker, R. S., Williams, J. J., Smyth, P., Yu, R., Slater, S., Baker, R., & Warschauer, M. (2020). Mining big data in education: Affordances and challenges. Review of Research in Education, 44(1), 130–160. https://doi.org/10.3102/0091732X20903304

Gibbs, G. (1988). Learning by doing: A guide to teaching and learning methods. Further Education Unit.

Gliem, J. A., & Gliem, R. R. (2003). Calculating, interpreting, and reporting Cronbach’s alpha reliability coefficient for Likert-type scales. In Proceedings of the 2003 Midwest Research-to-Practice Conference in Adult, Continuing, and Community Education, 8–10 October 2003, Columbus, Ohio (pp. 82–88). ScholarWorks Indianapolis. https://hdl.handle.net/1805/344

Grevisse, C. (2024). LLM-based automatic short answer grading in undergraduate medical education. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-06026-5

Gurung, A., Vanacore, K., McReynolds, A. A., Ostrow, K. S., Worden, E., Sales, A. C., & Heffernan, N. T. (2024). Multiple choice vs. fill-in problems: The trade-off between scalability and learning. In Proceedings of the 14th International Conference on Learning Analytics and Knowledge (LAK 2024), 18–22 March 2024, Tokyo, Japan (pp. 507–517). ACM. https://doi.org/10.1145/3636555.3636908

Gwet, K. (2001). Handbook of inter-rater reliability. STATAXIS Publishing Company.

Jang, M., & Lukasiewicz, T. (2023). Consistency analysis of ChatGPT. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 6–10 December 2023, Singapore (pp. 15970–15985). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.991

Josse, J., & Husson, F. (2012). Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics & Data Analysis, 56(6), 1869–1879. https://doi.org/10.1016/j.csda.2011.11.012

Koedinger, K. R., Baker, R. S., Cunningham, K., Skogsholm, A., Leber, B., & Stamper, J. (2010). A data repository for the EDM community: The PSLC DataShop. In C. Romero, S. Ventura, M. Pechenizkiy, & R. Baker (Eds.), Handbook of educational data mining (pp. 43–56). CRC Press. https://doi.org/10.1201/b10274-10

Lin, J., Chen, E., Han, Z., Gurung, A., Thomas, D. R., Tan, W., Nguyen, N. D., & Koedinger, K. R. (2024). How can I improve Using GPT to highlight the desired and undesired parts of open-ended responses. arXiv preprint arXiv:2405.00291. https://doi.org/10.48550/arXiv.2405.00291

Lin, J., Han, Z., Thomas, D. R., Gurung, A., Gupta, S., Aleven, V., & Koedinger, K. R. (2025). How can I get it right? Using GPT to rephrase incorrect trainee responses. International Journal of Artificial Intelligence in Education, 35, 482–508. https://doi.org/10.1007/s40593-024-00408-y

Liu, O. L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M. C. (2014). Automated scoring of constructed-response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28. https://doi.org/10.1111/emip.12028

Liu, R., & Koedinger, K. R. (2017). Towards reliable and valid measurement of individualized student parameters. In X. Hu, T. Barnes, A. Hershkovitz, & L. Paquette (Eds.), Proceedings of the 10th International Conference on Educational Data Mining (EDM 2017), 25–28 June 2017, Wuhan, China (pp. 135–142). International Educational Data Mining Society. https://educationaldatamining.org/EDM2017/proc files/proceedings.pdf

Liu, Y., Yao, Y., Ton, J.- F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M. F., & Li, H. (2023). Trustworthy LLMs: A survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374. https://doi.org/10.48550/arXiv.2308.05374

Mangaroska, K., & Giannakos, M. (2018). Learning analytics for learning design: A systematic literature review of analytics-driven design to enhance learning. IEEE Transactions on Learning Technologies, 12(4), 516–534. https://doi.org/10.1109/TLT.2018.2868673

Misiejuk, K., López-Pernas, S., Kaliisa, R., & Saqr, M. (2025). Mapping the landscape of generative artificial intelligence in learning analytics: A systematic literature review. Journal of Learning Analytics, 12(1), 12–31. https://doi.org/10.18608/jla.2025.8591

National Student Support Accelerator. (2023). Toolkit for tutoring programs. https://nssa.stanford.edu/tutoring

National Student Support Accelerator. (2025). Tutoring quality standards. https://nssa.stanford.edu/tqis/quality-standards

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

O’Neil, S., & Schotka, R. (2020). CRLA ITTPC standards, outcomes, and assessments (2nd ed.). College Reading and Learning Association. https://cdn.ymaws.com/crla.net/resource/resmgr/soas/crla ittpc standards outcome.pdf

Pardos, Z. A., & Heffernan, N. T. (2010). Modeling individualization in a Bayesian networks implementation of knowledge tracing. In P. De Bra, A. Kobsa, & D. Chin (Eds.), User modeling, adaptation, and personalization. UMAP 2010. Lecture notes in computer science (pp. 255–266, Vol. 6075). Springer. https://doi.org/10.1007/978-3-642-13470-8_24

Pavlik, P. I., Cen, H., & Koedinger, K. R. (2009). Performance factors analysis: A new alternative to knowledge tracing. Frontiers in Artificial Intelligence and Applications, 200(1), 531–538. https://doi.org/10.3233/978-1-60750-028-5-531

Post, M. W. (2016). What to do with “moderate” reliability and validity coefficients? Archives of Physical Medicine and Rehabilitation, 97(7), 1051–1052. https://doi.org/10.1016/j.apmr.2016.04.001

Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555

Robinson, C. D. (2022). A framework for motivating teacher-student relationships. Educational Psychology Review, 34(4), 2061–2094. https://doi.org/10.1007/s10648-022-09706-0

Saga Coach. (2025). Saga Coach Program. https://saga.org/products/saga-coach/

Saga Education. (2021, May). National tutoring nonprofit launches free, online training to help scale tutoring programs [press release]. https://saga.org/national-tutoring-nonprofit-launches-free-online-training-to-help-scale-tutoring-programs/

Scheffel, M., Drachsler, H., Toisoul, C., Ternier, S., & Specht, M. (2017). The proof of the pudding: Examining validity and reliability of the evaluation framework for learning analytics. In E. Lavoué, H. Drachsler, K. Verbert, J. Broisin, & M. P. Pérez-Sanagustın (Eds.), Data-driven approaches in digital education. EC-TEL 2017. Lecture notes in computer science (pp. 194–208, Vol. 10474). Springer. https://doi.org/10.1007/978-3-319-66610-5_15

Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Conference on Learning Analytics and Knowledge (LAK 2025), 3–7 March 2025, Dublin, Ireland (pp. 462–472). ACM. https://doi.org/10.1145/3706468.3706527

Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271. https://doi.org/10.1111/j.2044-8295.1910.tb00206.x

Thomas, D., Yang, X., Gupta, S., Adeniran, A., Mclaughlin, E., & Koedinger, K. (2023). When the tutor becomes the student: Design and evaluation of efficient scenario-based lessons for tutors. In Proceedings of the 13th International Conference on Learning Analytics and Knowledge (LAK 2023), 13–17 March 2023, Arlington, Texas, USA (pp. 250–261). ACM. https://doi.org/10.1145/3576050.3576089

Thomas, D. R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., & Koedinger, K. R. (2025a). Do tutors learn from equity training and can generative AI assess it? In Proceedings of the 15th International Conference on Learning Analytics and Knowledge (LAK 2025), 3–7 March 2025, Dublin, Ireland (pp. 505–515). ACM. https://doi.org/10.1145/3706468.3706531

Thomas, D. R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., & Koedinger, K. R. (2025b). Does multiple choice have a future in the age of generative AI? A posttest-only RCT. In Proceedings of the 15th International Conference on Learning Analytics and Knowledge (LAK 2025), 3–7 March 2025, Dublin, Ireland (pp. 494–504). ACM. https://doi.org/10.1145/3706468.3706530

Walker, D. A. (2005). A comparison of the Spearman-Brown and Flanagan-Rulon formulas for split half reliability under various variance parameter conditions. Journal of Modern Applied Statistical Methods, 5(2). https://doi.org/10.22237/jmasm/1162354620

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., & Li, J. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for llms. NPJ Digital Medicine, 7(1). https://doi.org/10.1038/s41746-024-01029-4

Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2024). Tutor CoPilot: A human-AI approach for scaling real-time expertise. Research Square preprint. https://doi.org/10.21203/rs.3.rs-5363154/v1

Warrens, M. J. (2015). On Cronbach’s alpha as the mean of all split-half reliabilities. In R. Millsap, D. Bolt, L. van der Ark, & W. Wang (Eds.), Quantitative psychology research. Springer proceedings in mathematics & statistics (pp. 293–300, Vol. 89). Springer. https://doi.org/10.1007/978-3-319-07503-7_18

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.52202/068431-1800

Winne, P. H. (2020). Construct and consequential validity for learning analytics based on trace data. Computers in Human Behavior, 112, 106457. https://doi.org/10.1016/j.chb.2020.106457

Wise, A. F., Knight, S., & Ochoa, X. (2021). What makes learning analytics research matter. Journal of Learning Analytics, 8(3), 1–9. https://doi.org/10.18608/jla.2021.7647

Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRaterSM v1.0. ETS Research Report Series, 2008(2), i–102. https://doi.org/10.1002/j.2333-8504.2008.tb02148.x

Yan, L., Martinez-Maldonado, R., & Gasevic, D. (2024). Generative artificial intelligence in learning analytics: Contextualising opportunities and challenges through the learning analytics cycle. In Proceedings of the 14th International Conference on Learning Analytics and Knowledge (LAK 2024), 18–22 March 2024, Tokyo, Japan (pp. 101–111). ACM. https://doi.org/10.1145/3636555.3636856

Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutorials in Quantitative Methods for Psychology, 9(2), 79–94. https://doi.org/10.20982/tqmp.09.2.p079

Yoon, S.- Y., & Zechner, K. (2017). Combining human and automated scores for the improved assessment of non-native speech. Speech Communication, 93, 43–52. https://doi.org/10.1016/j.specom.2017.08.001

Zhang, D. -W., Boey, M., Tan, Y. Y., & Jia, A. H. S. (2024). Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Science of Learning, 9(1), 79. https://doi.org/10.1038/s41539-024-00291-1

Zhang, L., Lin, J., Borchers, C., Cao, M., & Hu, X. (2024). 3DG: A framework for using generative AI for handling sparse learner performance data from intelligent tutoring systems. arXiv preprint arXiv:2402.01746. https://doi.org/10.48550/arXiv.2402.01746

Learning-Aware Reliability Estimation for Tutor Skill Assessment Using Large Language Models

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)