Metrics for Discrete Student Models: Chance Levels, Comparisons, and Use Cases


  • Nigel Bosch National Center for Supercomputing Applications University of Illinois at Urbana-Champaign 1205 W Clark Street Urbana, IL 61801
  • Luc Paquette Department of Curriculum & Instruction University of Illinois at Urbana-Champaign 1310 S Sixth Street Champaign, IL 61820



Metrics, chance level, F1, Cohen's kappa, precision, recall, discrete student models, imbalanced data, imbalanced predictions


Metrics including Cohen’s kappa, precision, recall, and F1 are common measures of performance for models of discrete student states, such as a student’s affect or behaviour. This study examined discrete model metrics for previously published student model examples to identify situations where metrics provided differing perspectives on model performance. Simulated models also systematically showed the effects of imbalanced class distributions in both data and predictions, in terms of the values of metrics and the chance levels (values obtained by making random predictions) for those metrics. Random chance level for F1 was also established and evaluated. Results for example student models showed that over-prediction of the class of interest (positive class) was relatively common. Chance-level F1 was inflated by over-prediction; conversely, maximum possible values for F1 and kappa were negatively impacted by over-prediction of the positive class. Additionally, normalization methods for F1 relative to chance are discussed and compared to kappa, demonstrating an equivalence between kappa and normalized F1. Finally, implications of results for choice of metrics are discussed in the context of common student modelling goals, such as avoiding false negatives for student states that are negatively related to learning.


Bailey, B. P., & Konstan, J. A. (2006). On the need for attention-aware systems: Measuring effects of interruption on task performance, error rate, and affective state. Computers in Human Behavior, 22(4), 685–708.

Baker, R. S., Corbett, A. T., & Aleven, V. (2008). More accurate student modeling through contextual estimation of slip and guess probabilities in Bayesian knowledge tracing. In B. Woolf, E. Aïmeur, R. Nkambou, & S. Lajoie (Eds.), Proceedings of the 9th International Conference on Intelligent Tutoring Systems (ITS 2008), 23–27 June 2008, Montreal, PQ, Canada (pp. 406–415). Springer.

Baker, R. S., Corbett, A. T., Koedinger, K. R., & Wagner, A. Z. (2004). Off-task behavior in the cognitive tutor classroom: When students “game the system.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ʼ04), 24–29 April 2004, Vienna, Austria (pp. 383–390). New York: ACM.

Baker, R. S., D’Mello, S. K., Rodrigo, M. M. T., & Graesser, A. (2010). Better to be frustrated than bored: The incidence, persistence, and impact of learners’ cognitive–affective states during interactions with three different computer-based learning environments. International Journal of Human–Computer Studies, 68(4), 223–241.

Beck, J., & Rodrigo, M. M. T. (2014). Understanding wheel spinning in the context of affective factors. In S. Trausan-Matu, K. E. Boyer, M. Crosby, & K. Panourgia (Eds.), Proceedings of the 12th International Conference on Intelligent Tutoring Systems (ITS 2014), 5–9 June 2014, Honolulu, HI, USA (pp. 162–167). New York: Springer.

Bixler, R., & D’Mello, S. K. (2015). Automatic gaze-based detection of mind wandering with metacognitive awareness. In F. Ricci, K. Bontcheva, O. Conlan, & S. Lawless (Eds.), Proceedings of the 23rd International Conference on User Modeling, Adaptation and Personalization (UMAP 2015) 29 June–3 July 2015, Dublin, Ireland (pp. 31–43). Springer.

Bosch, N., Crues, R. W., Henricks, G. M., Perry, M., Angrave, L., Shaik, N., Bhat, S., & Anderson, C. J. (2018). Modeling key differences in underrepresented students’ interactions with an online STEM course. Proceedings of the Technology, Mind, and Society Conference (APATech18), 5–7 April 2018, Washington, DC, USA. New York: ACM.

Bosch, N., D’Mello, S. K., Ocumpaugh, J., Baker, R. S., & Shute, V. (2016). Using video to automatically detect learner affect in computer-enabled classrooms. ACM Transactions on Interactive Intelligent Systems (TiiS), 6(2).

Botelho, A. F., Baker, R. S., & Heffernan, N. T. (2017). Improving sensor-free affect detection using deep learning. In E. André, R. S.

Baker, X. Hu, M. M. T. Rodrigo, & B. du Boulay (Eds.), Proceedings of the 18th International Conference on Artificial Intelligence in Education (AIED 2017), 28 June–1 July 2017, Wuhan, China (pp. 40–51). Springer.

Bower, G. H. (1992). How might emotions affect learning? The Handbook of Emotion and Memory: Research and Theory, 3–31. Hillsdale, NJ: Lawrence Erlbaum.

Calvo, R. A., & D’Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18–37.

Cetintas, S., Si, L., Xin, Y. P. P., & Hord, C. (2010). Automatic detection of off-task behaviors in intelligent tutoring systems with machine learning techniques. IEEE Transactions on Learning Technologies, 3(3), 228–236.

Chen, X., Vorvoreanu, M., & Madhavan, K. (2014). Mining social media data for understanding students’ learning experiences. IEEE Transactions on Learning Technologies, 7(3), 246–259.

Chrysafiadi, K., & Virvou, M. (2013). Student modeling approaches: A literature review for the last decade. Expert Systems with Applications, 40(11), 4715–4729.

Cocea, M., Hershkovitz, A., & Baker, R. S. (2009). The impact of off-task and gaming behaviors on learning: Immediate or aggregate? In V. Dimitrova, R. Mizoguchi, B. du Boulay, & A.

Graesser (Eds.), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED ʼ09) 6–10 July 2009, Brighton, UK (pp. 507–514). Amsterdam, Netherlands: IOS Press.

Cohen, W. W. (1995). Fast effective rule induction. In A. Prieditis & S. Russell (Eds.), Proceedings of the 12th International Conference on Machine Learning (ML95), 9–12 July 1995, Tahoe City, California (pp. 115–123). San Francisco, CA: Morgan Kaufmann.

Desmarais, M. C., & Baker, R. S. (2012). A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 22(1–2), 9–38.

Ekman, P., & Friesen, W. V. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA: Consulting Psychologists Press.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(Mar), 1289–1305.

Gardner, J., & Brooks, C. (2017). Student success prediction in MOOCs. ArXiv:1711.06349 [Cs, Stat].

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

Henrie, C. R., Halverson, L. R., & Graham, C. R. (2015). Measuring student engagement in technology-mediated learning: A review. Computers & Education, 90(1), 36–53.

Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the F-Measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298.

Hutt, S., Mills, C., Bosch, N., Krasich, K., Brockmole, J., & D’Mello, S. K. (2017). Out of the fr-“eye”-ing pan: Towards gaze-based models of attention during learning with technology in the classroom. Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP 2017), 9–12 July 2017, Bratislava, Slovakia (pp. 94–103). New York: ACM.

Jeni, L. A., Cohn, J. F., & De la Torre, F. (2013). Facing imbalanced data: Recommendations for the use of performance metrics. Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII ’13), 2–5 September 2013, Geneva, Switzerland (pp. 245–251). IEEE Computer Society.

Kort, B., Reilly, R., & Picard, R. W. (2001). An affective model of interplay between emotions and learning: Reengineering educational pedagogy-building a learning companion. Proceedings of the IEEE International Conference on Advanced Learning Technologies (ICALT 2001), 6–8 August 2001, Madison, WI, USA (pp. 43–46). IEEE Computer Society.

Lawvere, F. W. (1973). Metric spaces, generalized logic, and closed categories. Rendiconti Del Seminario Matématico e Fisico Di Milano, 43, 135–166.

Lobo, J. M., Jiménez-Valverde, A., & Real, R. (2008). AUC: A misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17(2), 145–151.

McVay, J. C., & Kane, M. J. (2009). Conducting the train of thought: Working memory capacity, goal neglect, and mind wandering in an executive-control task. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(1), 196–204.

Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion recognition in spontaneous speech using GMMs. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH 2006 — ICSLP), 17–21 September 2006, Pittsburgh, PA, USA (pp. 809–812). International Speech Communication Association.

Nissen, M. J., & Bullemer, P. (1987). Attentional requirements of learning: Evidence from performance measures. Cognitive Psychology, 19(1), 1–32.

Ocumpaugh, J., Baker, R., & Rodrigo, M. M. T. (2015). Baker Rodrigo Ocumpaugh Monitoring Protocol (BROMP) 2.0 technical and training manual. Technical Report. New York: Teachers College, Columbia University/Manila, Philippines: Ateneo Laboratory for the Learning Sciences.

Papamitsiou, Z., & Economides, A. A. (2014). Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence. Journal of Educational Technology & Society, 17(4), 49–64.

Paquette, L., de Carvalho, A. M., & Baker, R. S. (2014). Towards understanding expert coding of student disengagement in online learning. In P. Bello, M. Guarini, M. McShane, & B. Scassellati (Eds.), Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci 2014), 23–26 July 2014, Quebec City, Canada (pp. 1126–1131). Austin, TX: Cognitive Science Society.

Pardos, Z. A., Baker, R. S., San Pedro, M. O. C. Z., Gowda, S. M., & Gowda, S. M. (2013). Affective states and state tests: Investigating how affect throughout the school year predicts end of year learning outcomes. Proceedings of the 3rd International Conference on Learning Analytics and Knowledge (LAK ’13), 8–12 April 2013, Leuven, Belgium (pp. 117–124). New York: ACM.

Pekrun, R., Goetz, T., Titz, W., & Perry, R. P. (2002). Academic emotions in students’ self-regulated learning and achievement: A program of qualitative and quantitative research. Educational Psychologist, 37(2), 91–105.

Pelánek, R. (2015). Metrics for evaluation of student models. Journal of Educational Data Mining, 7(2), 1–19.

Pelánek, R. (2017). Measuring predictive performance of user models: The details matter. Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP 2017), 9–12 July 2017, Bratislava, Slovakia (pp. 197–201). New York: ACM.

Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, Informedness, Markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37–63.

Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann.

Raca, M., Kidzinski, L., & Dillenbourg, P. (2015). Translating head motion into attention: Towards processing of student’s body-language. In O. C. Santos et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (EDM2015), 26–29 June 2015, Madrid, Spain (pp. 320–326). International Educational Data Mining Society.

Robinson, C., Yeomans, M., Reich, J., Hulleman, C., & Gehlbach, H. (2016). Forecasting student achievement in MOOCs with natural language processing. Proceedings of the 6th International Conference on Learning Analytics and Knowledge (LAK ʼ16), 25–29 April 2016, Edinburgh, UK (pp. 383–387). New York: ACM.

Roux, L., Racoceanu, D., Loménie, N., Kulikova, M., Irshad, H., Klossa, J., Capron, F., Genestie, C., Le Naour, G., & Gurcan, M. N. (2013). Mitosis detection in breast cancer histological images: An ICPR 2012 contest. Journal of Pathology Informatics, 4.

Smallwood, J., Fishman, D. J., & Schooler, J. W. (2007). Counting the cost of an absent mind: Mind wandering as an underrecognized influence on educational performance. Psychonomic Bulletin & Review, 14(2), 230–236.

Smallwood, J., & Schooler, J. W. (2015). The science of mind wandering: Empirically navigating the stream of consciousness. Annual Review of Psychology, 66(1), 487–518.

Soleymani, M., Pantic, M., & Pun, T. (2012). Multimodal emotion recognition in response to videos. IEEE Transactions on Affective Computing, 3(2), 211–223.

Stewart, A., Bosch, N., Chen, H., Donnelly, P. J., & D’Mello, S. K. (2017). Face forward: Detecting mind wandering from video during narrative film comprehension. In E. André, R. S. Baker, X. Hu, M. M. T. Rodrigo, & B. du Boulay (Eds.), Proceedings of the 18th International Conference on Artificial Intelligence in Education (AIED 2017), 28 June–1 July 2017, Wuhan, China (pp. 359–370). Springer.

Stewart, A., Bosch, N., & D’Mello, S. K. (2017). Generalizability of face-based mind wandering detection across task contexts. In X. Hu, T. Barnes, A. Hershkovitz, & L. Paquette (Eds.), Proceedings of the 10th International Conference on Educational Data Mining (EDM2017), 25–28 June 2017, Wuhan, China (pp. 88–95). International Educational Data Mining Society.

Trigwell, K., Ellis, R. A., & Han, F. (2012). Relations between students’ approaches to learning, experienced emotions and outcomes of learning. Studies in Higher Education, 37(7), 811–824.

Valstar, M. F., Mehu, M., Jiang, B., Pantic, M., & Scherer, K. (2012). Meta-analysis of the first facial expression recognition challenge. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4), 966–979.

Walonoski, J. A., & Heffernan, N. T. (2006). Detection and analysis of off-task gaming behavior in intelligent tutoring systems. In M. Ikeda, K. Ashlay, & T.-W. Chan (Eds.), Proceedings of the 8th International Conference on Intelligent Tutoring Systems (ITS 2006), 26–30 June 2006, Jhongli, Taiwan (pp. 382–391). Springer.

Yeh, A. (2000). More accurate tests for the statistical significance of result differences. Proceedings of the 18th Conference on Computational Linguistics (COLING ’00), 31 July–4 August 2000, Saarbrücken, Germany (Vol. 2, pp. 947–953). Stroudsburg, PA: Association for Computational Linguistics.




How to Cite

Bosch, N., & Paquette, L. (2018). Metrics for Discrete Student Models: Chance Levels, Comparisons, and Use Cases. Journal of Learning Analytics, 5(2), 86–104.



Special Section: Methodological Choices in Learning Analytics