Using Instruction-Embedded Formative Assessment to Predict State Summative Test Scores and Achievement Levels in Mathematics


  • Guoguo Zheng University of Georgia
  • Stephen Edward Fancsali Carnegie Learning, Inc.
  • Steven Ritter Carnegie Learning, Inc.
  • Susan Berman Carnegie Learning, Inc.



Intelligent Tutoring Systems, Formative Assessment, Mathematics Education, Accountability, Assessment, Predictive Modeling


If we wish to embed assessment for accountability within instruction, we need to better understand the relative contribution of different types of learner data to statistical models that predict scores and discrete achievement levels on assessments used for accountability purposes. The present work scales up and extends predictive models of math test scores and achievement levels from existing literature and specifies six categories of models that incorporate information about student prior knowledge, socio-demographics, and performance within the MATHia intelligent tutoring system. Linear regression, ordinal logistic regression, and random forest regression and classification models are learned within each category and generalized over a sample of 23,000+ learners in Grades 6, 7, and 8 over three academic years in Miami-Dade County Public Schools. After briefly exploring hierarchical models of this data, we discuss a variety of technical and practical applications, limitations, and open questions related to this work, especially concerning to the potential use of instructional platforms like MATHia as a replacement for time- consuming standardized tests.


Anozie, N. O., & Junker, B. W. (2006). Predicting end-of-year accountability assessment scores from monthly student records in an online tutoring system. AAAI Workshop on Educational Data Mining (AAAI-06), 17 July 2006, Boston, MA, USA.

Ayers, E., & Junker, B. W. (2008). IRT modeling of tutor performance to predict end-of-year exam scores. Educational and Psychological Measurement, 68(6), 972–987. 10.1177/0013164408318758

Baker, R. S .J. d., Corbett, A. T., Roll, I., & Koedinger, K. R. (2008). Developing a generalizable detector of when students game the system. User Modeling and User-Adapted Interaction, 18, 287–314.

Baker, R. S. J. d., Gowda, S. M., Wixon, M., Kalka, J., Wagner, A. Z., Salvi, A., Aleven, V., Kusbit, G. W., Ocumpaugh, J., & Rossi, L. (2012). Towards sensor-free affect detection in Cognitive Tutor Algebra. In K. Yacef, O. Zaiane, A. Hershkovitz, M. Yudelson, & J. Stamper (Eds.), Proceedings of the 5th International Conference on Educational Data Mining (EDM2012), 19–21 June 2012, Chania, Greece (pp. 126–133). International Educational Data Mining Society.

Beck, J. E., Lia, P., & Mostow, J. (2004). Automatically assessing oral reading fluency in a tutor that listens. Technology, Instruction, Cognition and Learning, 2(1–2), 61–81.

Binet, A. (1909). Les idées modernes sur les enfants. [Modern concepts concerning children.] Paris: Flammarion.

Black, P. J., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy and Practice, 5(1), 7–73.

Bloom, B. S. (1968). Learning for mastery. Evaluation Comment, 1(2).

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

Brookhart, S. M. (2009). Editorial: Special issue on the validity of formative and interim assessment. Educational Measurement: Issues and Practice, 28(3), 1–4.

Buckingham, B. R. (1921). Intelligence and its measurement: A symposium XIV. Journal of Educational Psychology, 12, 271–275.

Campione, J. C., & Brown, A. L. (1985). Dynamic assessment: One approach and some initial data. Technical Report No. 361. Champaign, IL: University of Illinois at Urbana-Champaign, Center for the Study of Reading.

Carpenter, S. K., Cepeda, N. J., Rohrer, D., Kang, S. H. K., & Pashler, H. (2012). Using spacing to enhance diverse forms of learning: Review of recent research and implications for instruction. Educational Psychology Review, 24, 369–378.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.

Danks, D., & London, A. J. (2017). Algorithmic bias in autonomous systems. Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17), 19–25 August 2017, Melbourne, Australia (pp. 4691–4697). Palo Alto, CA: AAAI Press.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.

Evans, C., & Lyons, S. (2017). Comparability in balanced assessment systems for state accountability. Educational Measurement Issues and Practice, 36(3), 24–34.

Feng, M., Heffernan, N. T., & Koedinger, K. R. (2006). Predicting state test scores better with intelligent tutoring systems: Developing metrics to measure assistance required. In M. Ikeda, K. Ashlay, & T.-W. Chan (eds.), Intelligent Tutoring Systems, ITS 2006. Lecture Notes in Computer Science, vol 4053. Springer, Berlin, Heidelberg

Florida Department of Education. (2014). FCAT 2.0 and Florida EOC Assessments Achievement Levels.

Florida Department of Education. (2017). Florida Standards Assessment: 2016–17 FSA English Language Arts and Mathematics Fact Sheet.

Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11(1), 86–92.

García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180, 2044–2064.

Gardner, J., & Brooks, C. (2018). Evaluating predictive models of student success: Closing the methodological gap. Journal of Learning Analytics, 5(2), 105–125.

Gettinger, M., & White, M. A. (1980). Evaluating curriculum fit with class ability. Journal of Educational Psychology, 72, 338–344.

Grigorenko, E. L., & Sternberg, R. J. (1998). Dynamic testing. Psychological Bulletin, 124(1), 75.

Harlen, W., & James, M. (1997). Assessment and learning: Differences and relationships between formative and summative assessment. Assessment in Education: Principles, Policy & Practice, 4(3), 365–379,

Hart, R., Casserly, M., Uzzell, R., Palacios, M., Corcoran, A., & Spurgeon, A. (2015). Student testing in America’s great city schools: An inventory and preliminary analysis. Washington, DC: Council of Great City Schools.

Heritage, M. (2010). Formative assessment and next-generation assessment systems: Are we losing an opportunity? Washington, D.C.: Council of Chief State School Officers.

Hodges, J. L., & Lehmann, E. L. (1962). Ranks methods for combination of independent experiments in analysis of variance. Annals of Mathematical Statistics, 33, 482–497.

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.

Joshi, A., Fancsali, S. E., Ritter, S., Nixon, T., & Berman, S. (2014). Generalizing and extending a predictive model for standardized test scores based on cognitive tutor interactions. In J. Stamper et al. (Eds.), Proceedings of the 7th International Conference on Educational Data Mining (EDM2014), 4–7 July 2014, London, UK (pp. 369–370). International Educational Data Mining Society.

Junker, B. W. (2006). Using on-line tutoring records to predict end-of-year exam scores: Experience with the ASSISTments project and MCAS 8th grade mathematics. In R. W. Lissitz (Ed.), Assessing and modeling cognitive development in school: Intellectual growth and standard setting. Maple Grove, MN: JAM Press.

Koedinger, K. R., Corbett, A. T., & Perfetti, C. (2012). The knowledge–learning–instruction framework: Bridging the science–practice chasm to enhance robust student learning. Cognitive Science, 36(5), 757–798.

Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583–621.

Lazarín, M. (2014, October). Testing overload in America’s schools. Washington, DC: Center for American Progress.

Lehman, B., Hebert, D., Jackson, G. T., & Grace, L. (2017). Affect and experience: Case studies in games and test-taking. Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’17), 6–11 May 2017, Denver, Colorado, USA (pp. 917–924). New York: ACM.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.

Moses, S. (2017, March 28). State testing starts today; Opt out CNY leader says changes are “smoke and mirrors.”

Nelson, H. (2013, July). Testing more, teaching less: What America’s obsession with student testing costs in money and lost instructional time. Washington, D.C.: American Federation of Teachers.

O’Neil, C. (2016). Weapons of math destruction: How big data increases inequality and threatens democracy. New York: Broadway Books.

Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of cognitive tutor algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127–144.

Pardos, Z. A., Baker, R. S., San Pedro, M., Gowda, S. M., & Gowda, S. M. (2014). Affective states and state tests: Investigating how affect and engagement during the school year predict end-of-year learning outcomes. Journal of Learning Analytics, 1(1), 107–128.

Pardos, Z. A., Heffernan, N. T., Anderson, B., Heffernan, C. L., & Schools, W. P. (2010). Using fine-grained skill models to fit student performance with Bayesian networks. In C. Romero, S. Ventura, M. Pechenizkiy, & R. S. J. d. Baker (Eds.), Handbook of educational data mining (pp. 417–426). Boca Raton, FL: CRC Press.

Pashler, H., Cepeda, N. J., Wixted, J., & Rohrer, D. (2005). When does feedback facilitate learning of words? Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 3–8.

PDK/Gallup. (2015). 47th annual PDK/Gallup Poll of the Public’s Attitudes Toward the Public Schools: Testing Doesn't Measure Up For Americans. Phi Delta Kappan, 97(1).

Perie, M., Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28(3), 5–13.

Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163.

Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N. T., Koedinger, K. R., Junker, B., ... & Livak, T. (2005). Blending assessment and assisting. In C.-K. Looi, G. I. McCalla, B. Bredeweg, & J. Breuker (Eds.), Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED 2005), 18–22 July 2005, Amsterdam, The Netherlands (pp. 555–562). Amsterdam, The Netherlands: IOS Press.

Roediger, H. L., & Karpicke, J. D. (2006). The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science, 1, 181–210.

Ritter, S., Anderson, J. R., Koedinger, K. R., & Corbett, A. (2007). Cognitive Tutor: Applied research in mathematics education. Psychonomic Bulletin & Review, 14(2), 249–255.

Ritter, S., Joshi, A., Fancsali, S. E., & Nixon, T. (2013). Predicting standardized test scores from Cognitive Tutor interactions. In S. K. DʼMello et al. (Eds.), Proceedings of the 6th International Conference on Educational Data Mining (EDM2013), 6–9 July 2013, Memphis, TN, USA (pp. 169–176). International Educational Data Mining Society/Springer.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.

Shute, V. J., & Ke, F. (2012). Games, learning, and assessment. In D. Ifenthaler, D. Eseryel, & X. Ge (Eds.), Assessment in game-based learning. New York: Springer.

Shute, V. J., Levy, R., Baker, R., Zapata, D., & Beck, J. (2009). Assessment and learning in intelligent educational systems: A peek into the future. In S. D. Craig & D. Dicheva (Eds.), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED ʼ09), Vol. 3: Intelligent Educational Games, 6–10 July 2009, Brighton, UK (pp. 99–108). Amsterdam, The Netherlands: IOS Press.

Shute, V. J., & Moore, G. R. (2017). Consistency and validity in game-based stealth assessment. In H. Jiao & R. W. Lissitz (Eds.), Technology enhanced innovative assessment: Development, modeling, and scoring from an interdisciplinary perspective (pp. 31–51). Charlotte, NC: Information Age Publishing.

Sirin, S. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453.

Slade, S., & Prinsloo, P. (2013). Learning analytics: Ethical issues and dilemmas. American Behavioral Scientist, 57(10), 1509–1528.

Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational measurement. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 263–331). New York: American Council on Education, Macmillan.

State of Minnesota. (2017). Standardized student testing: 2017 evaluation report. Office of the Legislative Auditor.

Tagami, T. (2018, February 15). Alternative testing bill passes Georgia Senate. Politically Georgia.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.

US Department of Education. (2016). Table 215.30: Enrollment, poverty, and federal funds for the 120 largest school districts, by enrollment size in 2014 (selected years, 2013–14 through 2016). Digest of Education Statistics. National Center for Education Statistics.

Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York: Springer.

Walker, T. (2018, January 4). Educators strike big blow to overuse of standardized testing in 2017. neaToday: News and Features from the National Education Association.

Wiliam, D. (2011). Embedded formative assessment. Bloomington, IN: Solution Tree Press.

Zerr, C. L., Berg, J. J., Nelson, S. M., Fishell, A. K., Savalia, N. K., & McDermott, K. B. (2018). Learning efficiency: Identifying individual differences in learning rate and retention in healthy adults. Psychological Science, 29(9).




How to Cite

Zheng, G., Fancsali, S. E., Ritter, S., & Berman, S. (2019). Using Instruction-Embedded Formative Assessment to Predict State Summative Test Scores and Achievement Levels in Mathematics. Journal of Learning Analytics, 6(2), 153–174.