Large-Scale Assessments for Learning: A Human-Centred AI Approach to Contextualizing Test Performance

Hongwen Guo; Matthew  Johnson; Kadriye  Ercikan; Luis Saldivia; Michelle Worthington

doi:10.18608/jla.2024.8007

Authors

Hongwen Guo ETS Research Institute https://orcid.org/0000-0002-1751-0918
Matthew Johnson ETS Research Institute https://orcid.org/0000-0003-3157-4165
Kadriye Ercikan ETS Research Institute https://orcid.org/0000-0001-8056-9165
Luis Saldivia ETS Research Institute https://orcid.org/0009-0007-3482-7654
Michelle Worthington ETS Research Institute https://orcid.org/0009-0006-0480-3769

DOI:

https://doi.org/10.18608/jla.2024.8007

Keywords:

large-scale assessment, test-taking process profiles, human-in-the-loop, machine learning, deep learning, research paper

Abstract

Large-scale assessments play a key role in education: educators and stakeholders need to know what students know and can do, so that they can be prepared for education policies and interventions in teaching and learning. However, a score from the assessment may not be enough—educators need to know why students got low scores, how students engaged with the tasks and the assessment, and how students with different levels of skills worked through the assessment. Process data, combined with response data, reflect students’ test-taking processes and can provide educators such rich information, but manually labelling the complex data is hard to scale for large-scale assessments. From scratch, we leveraged machine learning techniques (including supervised, unsupervised, and active learning) and experimented with a general human-centred AI approach to help subject matter experts efficiently and effectively make sense of big data (including students’ interaction sequences with the digital assessment platform, such as response, timing, and tool use sequences) to provide process profiles, that is, a holistic view of students’ entire test-taking processes on the assessment, so that performance can be viewed in context. Process profiles may help identify different sources for low performance and help generate rich feedback to educators and policy makers. The released National Assessment of Educational Progress (NAEP) Grade 8 mathematics data were used to illustrate our proposed approach.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., . . . Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/

Baker, R., D’Mello, S., Rodrigo, M. M. T., & Graesser, A. C. (2010). Better to be frustrated than bored: The incidence, persistence, and impact of learners’ cognitive-affective states during interactions with three different computer-based learning environments. International Journal of Human-Computer Studies, 68(4), 223–241. https://doi.org/10.1016/j.ijhcs.2009.12.003

Baker, R. (2021). Artificial intelligence in education: Bringing it all together. In S. Vincent-Lancrin (Ed.), Pushing the frontiers with AI, blockchain, and robots (pp. 43–54). OECD.

Baker, R., Ogan, A., Madaio, M., & Walker, E. (2019). Culture in computer-based learning systems: Challenges and opportunities. Computer-based learning in context. Zenodo. https://doi.org/10.5281/zenodo.4057223

Baker, T., Smith, L., & Anissa, N. (2019). Educ-AI-tion rebooted? Exploring the future of artificial intelligence in schools and colleges (Report). NESTA. London, UK. https://www.nesta.org.uk/report/education-rebooted

Bennett, R. E., Zhang, M., Sinharay, S., Guo, H., & Deane, P. (2022). Are there distinctive profiles in examinee essay-writing processes? Educational Measurement: Issues and Practice, 41(2), 55–69. https://doi.org/10.1111/emip.12469

Biswas, G., Segedy, J., & Bunchongchit, K. (2016). From design to implementation to practice a learning by teaching system: Betty’s brain. International Journal of Artificial Intelligence in Education, 26, 350–364. https://doi.org/10.1007/s40593-015-0057-9

Deane, P.,Wilson, J., Zhang, M., Li, C., van Rijn, P., Guo, H., Roth, A.,Winchester, E., & Richter, T. (2021). The sensitivity of a scenario-based assessment of written argumentation to school differences in curriculum and instruction. International Journal of Artificial Intelligence in Education, 31(1), 57–98. https://doi.org/10.1007/s40593-020-00227-x

Ercikan, K., Guo, H., & He, Q. (2020). Use of response process data to inform group comparisons and fairness research. Educational Assessment, 25(3), 179–197. https://doi.org/10.1080/10627197.2020.1804353

Ercikan, K., Guo, H., & Por, H.-H. (2023). Uses of process data in advancing the practice and science of technology-rich assessments. In N. Foster & M. Piacentini (Eds.), Innovating assessments to measure and support complex skills (pp. 211–228). OECD Publishing. https://doi.org/10.1787/7b3123f1-en

Ercikan, K., & Pellegrino, J. (2017). Validation of score meaning in the next generation of assessments: The use of response processes. Routledge.

Foster, N., & Piacentini, M. (Eds.). (2023). Innovating assessments to measure and support complex skills. OECD Publishing. https://doi.org/10.1787/e5f3e341-en

Geron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media. https://www.bibsonomy.org/bibtex/2a91270a3a516f4edaa5d459c40317fcc/achakraborty

Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is deep learning the best approach to knowledge tracing? Journal of Educational Data Mining, 12(3), 31–54. https://doi.org/10.5281/zenodo.4143614

Gordon, E. (2020). Toward assessment in the service of learning. Educational Measurement: Issues and Practice, 39(3), 72–78. https://doi.org/10.1111/emip.12370

Greiff, S., Niepel, C., Scherer, R., & Martin, R. (2016). Understanding students’ performance in a computer-based assessment of complex problem solving: An analysis of behavioral data from computer-generated log files. Computers in Human Behavior, 61, 36–46. https://doi.org/10.1016/j.chb.2016.02.095

Guo, H. (2022). How did students engage with a remote educational assessment? A case study. Educational Measurement: Issues and Practice, 41(3), 58–68. https://doi.org/10.1111/emip.12476

Guo, H., & Ercikan, K. (2021a). Differential rapid responding across language and cultural groups. Educational Research and Evaluation, 26(5-6), 302–327. https://doi.org/10.1080/13803611.2021.1963941

Guo, H., & Ercikan, K. (2021b). Using response-time data to compare the testing behaviors of English language learners (ELLs) to other test-takers (non-ELLs) on a mathematics assessment. ETS Research Report, 2021(1), 1–15. https://doi.org/10.1002/ets2.12340

Guo, H., Zhang, M., Deane, P., & Bennett, R. (2020). Effects of scenario-based assessment on students’ writing processes. Journal of Educational Data Mining, 12(1), 19–45. https://doi.org/10.5281/zenodo.3911797

International Test Commission and Association of Test Publishers. (2022). Guidelines for technology-based assessment. https://www.intestcom.org/news/38

Johnson, M. S., & Liu, X. (2022). Psychometric considerations for the joint modeling of response and process data [Paper presented at the 2022 IMPS annual meeting, 11–15 July 2022, Bologna, Italy].

Kleinman, E., Shergadwala, M., Teng, Z., Villareale, J., Bryant, A., Zhu, J., & Seif El-Nasr, M. (2022). Analyzing students’ problem-solving sequences: A human-in-the-loop approach. Journal of Learning Analytics, 9(2), 138–160. https://doi.org/10.18608/jla.2022.7465

Lagud, M. C. V., & Rodrigo, M. M. T. (2010). The affective and learning profiles of students using an intelligent tutoring system for algebra. In V. Aleven, J. Kay, & J. Mostow (Eds.), Proceedings of the 10th International Conference on Intelligent Tutoring Systems (ITS 2010), 14–18 June 2010, Pittsburgh, Pennsylvania, USA (pp. 255–263). Springer. https://doi.org/10.1007/978-3-642-13388-6_30

Levy, R. (2020). Implications of considering response process data for greater and lesser psychometrics. Educational Assessment, 25(3), 218–235. https://doi.org/10.1080/10627197.2020.1804352

Miao, F., Holmes, W., Huang, R., & Zhang, H. (2021). AI and education: Guidance for policymakers. UNESCO.

National Assessment Governing Board. (2020). Response process data from the 2017 NAEP grade 8 mathematics assessment. https://www.nationsreportcard.gov/process_data/

National Center for Education Statistics. (2017). NAEP 2017 sample design. https://nces.ed.gov/nationsreportcard/tdw/sample_design/2017/naep_2017_sample_design.aspx

National Center for Education Statistics. (2021). NAEP DBA tutorial. https://enaep-public.naepims.org/2017/EN/main.html?

subject=Math8

National Center for Education Statistics. (2022). NAEP questions tool. https://nces.ed.gov/NationsReportCard/nqt/

National Research Council. (2000). How people learn: Brain, mind, experience, and school: Expanded edition. https://doi.org/10.17226/9853

Nawaz, S., Kennedy, G., Bailey, J., & Mead, C. (2020). Moments of confusion in simulation-based learning environments. Journal of Learning Analytics, 7(3), 118–137. https://doi.org/10.18608/jla.2020.73.9

Office of Educational Technology. (2023). Artificial intelligence and the future of teaching and learning: Insights and recommendations (Report). U.S. Department of Education. Washington, DC, 2023.

Organisation for Economic Co-operation and Development. (2020). PISA 2018 database. https://www.oecd.org/pisa/data/2018database/

Paquette, L., & Baker, R. S. (2019). Comparing machine learning to knowledge engineering for student behavior modeling: A case study in gaming the system. Interactive Learning Environments, 27(5-6), 585–597. https://doi.org/10.1080/10494820.2019.1610450

Pellegrino, J. W. (2020). Important considerations for assessment to function in the service of education. Educational Measurement: Issues and Practice, 39(3), 81–85. https://doi.org/10.1111/emip.12372

Pohl, S., Ulitzsch, E., & von Davier, M. (2021). Reframing rankings in educational assessments. Science, 372(6540), 338–340. https://www.science.org/doi/abs/10.1126/science.abd3300

Pools, E., & Monseur, C. (2021). Student test-taking effort in low-stakes assessments: Evidence from the English version of the PISA 2015 science test. Large-scale Assessments in Education, 9(10). https://doi.org/10.1186/s40536-021-00104-6

Radwan, A. M. (2019). Human active learning. In S. M. Brito (Ed.), Active learning. IntechOpen. https://doi.org/10.5772/intechopen.81371

Rios, J., & Guo, H. (2020). Can culture be a salient predictor of test-taking engagement? An analysis of differential noneffortful responding on an international college-level assessment of critical thinking ISLA. Applied Measurement in Education, 33(4), 263–279. https://doi.org/10.1080/08957347.2020.1789141

Rios, J., Guo, H., Mao, L., & Liu, O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104. https://doi.org/10.1080/15305058.2016.1231193

Rizve, M. N., Duarte, K., Rawat, Y. S., & Shah, M. (2021). In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In Proceedings of the 2021 International Conference on Learning Representations (ICLR 2021), 4 May 2021, Vienna, Austria. https://openreview.net/forum?id=-ODN6SbiUU

Trinh, H. D., Zeydan, E., Giupponi, L., & Dini, P. (2019). Detecting mobile traffic anomalies through physical control channel fingerprinting: A deep semi-supervised approach. IEEE Access, 7, 152187–152201. https://doi.org/10.1109/ACCESS.2019.2947742

Ulitzsch, E., He, Q., & Pohl, S. (2022). Using sequence mining techniques for understanding incorrect behavioral patterns on interactive tasks. Journal of Educational and Behavioral Statistics, 47(1), 3–35. https://doi.org/10.3102/10769986211010467

van der Linden. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z

Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456–477. https://doi.org/10.1111/bmsp.12054

Wise, S. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. https://doi.org/10.1111/emip.12165

Wise, S. (2021). Six insights regarding test-taking disengagement. Educational Research and Evaluation, 26(5-6), 328–338. https://doi.org/10.1080/13803611.2021.1963942

Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. V. (2020). Unsupervised data augmentation for consistency training. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 6–12 December 2020, Vancouver, British Columbia, Canada (pp. 6256–6268). Curran Associates. https://dl.acm.org/doi/pdf/10.5555/3495724.3496249

Zhu, X., Lafferty, J., & Ghahramani, Z. (2003). Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the ICML 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 21 August 2003,Washington, DC, USA (pp. 58–65). Springer. https://doi.org/10.1007/978-3-319-12637-1_27

Zoanetti, N., & Griffin, P. (2017). Log-file data as indicators for problem-solving processes. In B. Csapo & J. Funke (Eds.), The nature of problem solving: Using research to inspire 21st century learning. OECD Publishing. https://doi.org/10.1787/9789264273955-en

Large-Scale Assessments for Learning: A Human-Centred AI Approach to Contextualizing Test Performance

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License