Embedding Experiments: Staking Causal Inference in Authentic Educational Contexts

Joshua R. de Leeuw
Robert L. Goldstone


To identify the ways teachers and educational systems can improve learning, researchers need to make causal inferences. Analyses of existing datasets play an important role in detecting causal patterns, but conducting experiments also plays an indispensable role in this research. In this article, we advocate for experiments to be embedded in real educational contexts, allowing researchers to test whether interventions such as a learning activity, new technology, or advising strategy elicit reliable improvements in authentic student behaviours and educational outcomes. Embedded experiments, wherein theoretically relevant variables are systematically manipulated in real learning contexts, carry strong benefits for making causal inferences, particularly when allied with the data-rich resources of contemporary e-learning environments. Toward this goal, we offer a field guide to embedded experimentation, reviewing experimental design choices, addressing ethical concerns, discussing the importance of involving teachers, and reviewing how interventions can be deployed in a variety of contexts, at a range of scales. Causal inference is a critical component of a field that aims to improve student learning; including experimentation alongside analyses of existing data in learning analytics is the most compelling way to test causal claims.

Full Text:



Angrist, J. (2004). American education research changes tack. Oxford Review of Economic Policy, 20(2), 198-212. doi:10.1093/oxrep/grh011

Arnold, K. (2010). Signals: Applying academic analytics. EDUCAUSE Quarterly, 33(1). Retrieved from https://er.educause.edu/articles/2010/3/signals-applying-academic-analytics

Arnold, K., Umanath, S., Thio, K., Reilly, W., McDaniel, M., & Marsh, E. (2017). Understanding the cognitive processes involved in writing to learn. Journal of Experimental Psychology: Applied, 23(2), 115-127. doi:10.1037/xap0000119

Baker, R., & Inventado, P. (2014). Educational data mining and learning analytics. In Learning Analytics (pp. 61-75). New York: Springer.

Baker, R., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3-17.

Bakharia, A., Corrin, L., de Barba, P., Kennedy, G., Gašević, D., Mulder, R., Williams, D., Dawson, S., & Lockyer, L. (2016). A conceptual framework linking learning design with learning analytics. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (pp. 329-338). ACM.

Barab, S., & Squire, K. (2004). Design-based research: Putting a stake in the ground. Journal of the Learning Sciences, 13(1), 1-14. doi:10.1207/s15327809jls1301_1

Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing teaching and learning through educational data mining and learning analytics: An issue brief. Washington, DC: US Department of Education, Office of Educational Technology.

Booth, J., McGinn, K., Barbieri, C., Begolli, K., Chang, B., Miller-Cotto, D., Young, L., & Davenport, J. (2017). Evidence for cognitive science principles that impact learning in mathematics. In D. Geary, D. Bearch, R. Ochsendorf, & K. Koepke, Acquisition of Complex Arithmetic Skills and Higher-Order Mathematics Concepts (Vol. 3, pp. 297-327). Academic Press.

Bransford, J. D., Franks, J. J., Vye, N. J., & Sherwood, R. D. (1989). New approaches to instruction: Because wisdom can’t be told. In S. Vosniadou & A. Ortony (Eds.), Similarity and Analogical Reasoning, pp. 470-497. New York: Cambridge University Press.

Bryan, W., & Harter, N. (1899). Studies on the telegraphic language: The acquisition of a hierarchy of habits. Psychological Review, 6(4), 345-375. doi:10.1037/h0073117

Butler, A., Marsh, E., Slavinsky, J., & Baraniuk, R. (2014). Integrating cognitive science and technology improves learning in a STEM classroom. Educational Psychology Review, 26. doi:10.1007/s10648-014-9256-4

Cantor, A., & Marsh, E. (2017). Expertise effects in the Moses illusion: detecting contradictions with stored knowledge. Memory, 25(2), 220-230. doi:10.1080/09658211.2016.1152377

Carvalho, P., Braithwaite, D., de Leeuw, J., Motz, B., & Goldstone, R. (2016). An in vivo study of self-regulated study sequencing in introductory psychology courses. PLoS ONE, 11(3), e0152115. doi:10.1371/journal.pone.0152115

Chatti, M., Dyckhoff, A., Schroeder, U., & Thüs, H. (2012). A reference model for learning analytics. International Journal of Technology Enhanced Learning, 4(5/6), 318-331. doi:10.1504/ijtel.2012.051815

Chen, Z., Demirci, N., Choi, Y.-J., & Pritchard, D. (2017). To draw or not to draw? Examining the necessity of problem diagrams using massive open online course experiments. Physical Review Physics Education Research, 13, 010110. doi:10.1103/PhysRevPhysEducRes.13.010110

Clow, D. (2012). The learning analytics cycle: Closing the loop effectively. Proceedings of the 2nd International Conference on Learning Analytics and Knowledge (pp. 134-138). ACM.

Clow, D. (2013). An overview of learning analytics. Teaching in Higher Education, 18(6), 683-695. doi:10.1080/13562517.2013.827653

Cope, B., & Kalantzis, M. (2015). Interpreting Evidence-of-Learning: Educational research in the era of big data. Open Review of Educational Research, 2(1), 218-239. doi:10.1080/23265507.2015.1074870

Cope, B., & Kalantzis, M. (2015). Sources of Evidence-of-Learning: Learning and assessment in the era of big data. Open Review of Educational Research, 2(1), 194-217. doi:10.1080/23265507.2015.1074869

Crouch, C., & Mazur, E. (2001). Peer instruction: Ten years of experience and results. American Journal of Physics, 69, 970-977. doi:10.1119/1.1374249

Day, S., Motz, B., & Goldstone, R. (2015). The cognitive costs of context: The effects of concreteness and immersiveness in instructional examples. Frontiers in Psychology, 6, 1876. doi:10.3389/fpsyg.2015.01876

de Leeuw, J. R. (2015). jsPsych: a JavaScript library for creating behavioral experiments in a Web browser. Behavior Research Methods, 47(1), 1-12.

Dietz-Uhler, B., & Hurn, J. (2013). Using learning analytics to predict (and improve) student success: A faculty perspective. Journal of Interactive Online Learning, 12(1), 17-26.

Duval, E. (2011). Attention please!: Learning analytics for visualization and recommendation. In P. Long, G. Siemens, G. Conole, & D. Gašević (Ed.), Proceedings of the 1st International Conference on Learning Analytics and Knowledge (pp. 9-17). ACM.

Ebersole, C., Atherton, O., Belanger, A., Skulborstad, H., Allen, J., Banks, J., Baranski, E., Bernstein, M., Bonfiglio, D., Boucher, L., Brown, E., Budiman, N., Cairo, A., Capaldi, C., Chartier, C., Chung, J., Cicero, D., Coleman, J., Conway, J., … & Nosek, B. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68-82. doi:10.1016/j.jesp.2015.10.012

Elias, T. (2011). Learning analytics: Definitions, processes and potential. Retrieved from http://learninganalytics.net/LearningAnalyticsDefinitionsProcessesPotential.pdf

Ellis, R. (2005). Principles of instructed language learning. System, 33, 209-224. doi: 10.1016/j.system.2004.12.006

Enyon, R. (2013). The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38(3), 237-240. doi:10.1080/17439884.2013.771783

Finkelstein, N. D., Adams, W. K., Keller, C. J., Kohl, P. B., Perkins, K. K., Podolefsky, N. S., Reid, S., & LeMaster, R. (2005). When learning about the real world is better done virtually: A study of substituting computer simulations for laboratory equipment. Physical Review Physics Education Research, 1(1), 1.010103.

Frank, M., Bergelson, E., Bergmann, C., Cristia, A., Floccia, C., Gervain, J., Lew‐Williams, C., Nazzi, T., Panneton, R., Rabagliati, H., Soderstrom, M., Sullivan, J., Waxman, S., & Yurovsky, D. (2017). A collaborative approach to infant research: Promoting reproducibility, best practices, and theory‐building. Infancy, 22(4), 421-435. doi:10.1111/infa.12182

Fryer Jr., R. G. (2011). Financial incentives and student achievement: Evidence from randomized trials. The Quarterly Journal of Economics, 126(4), 1755-1798. doi:10.3386/w15898

Fyfe, E. (2016). Providing feedback on computer-based algebra homework in middle-school classrooms. Computers in Human Behavior, 63, 568-574. doi:10.1016/j.chb.2016.05.082

Gašević, D., Dawson, S., & Siemens, G. (2015). Let’s not forget: Learning analytics are about learning. TechTrends, 59(1), 64-71.

Gašević, D., Dawson, S., Rogers, T., & Gašević, D. (2016). Learning analytics should not promote one size fits all: The effects of instructional conditions in predicting academic success. The Internet and Higher Education, 28(1), 68-84. doi:10.1016/j.iheduc.2015.10.002

Goldstone, R. L., Kersten, A., & Carvalho, P. F. (2017). Categorization and Concepts. In J. Wixted (Ed.) Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, Fourth Edition, Volume Three: Language & Thought, pp. 275-317. New Jersey: Wiley.

Govaerts, S., Verbert, K., Duval, E., & Pardo, A. (2012). The student activity meter for awareness and self-reflection. CHI'12 Extended Abstracts on Human Factors in Computing Systems (pp. 869-884). ACM.

Greeno, J.G. & Middle School Mathematics through Applications Project Group. (1998). The situativity of knowing, learning, and research. American Psychologist, 53(1), 5-26. doi:10.1037/0003-066X.53.1.5

Guskey, T., & Huberman, M. (1995). Professional development in education: New paradigms and practices. New York: Teachers College Press.

Hall, G. (1891). The contents of children's minds on entering school. The Pedagogical Seminary, 1(2), 139-173. doi:10.1080/08919402.1891.10533930

Heath, L., Kendzierski, D., & Borgida, E. (1982). Evaluation of social programs: A multimethodological approach combining a delayed treatment true experiment and multiple time series. Evaluation Review, 6(2), 233-246. doi:10.1177/0193841X8200600205

Heffernan, N., & Heffernan, C. (2014). The ASSISTments ecosystem: building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education, 24(4), 470-497. doi:10.1007/s40593-014-0024-x

Henninger, F., Mertens, U. K., Shevchenko, Y., & Hillbig, B. E. (2017). lab.js: Browser-based behavioral research. doi:10.5281/zenodo.597045

Higgins, M., Sävje, F., & Sekhon, J. (2016). Improving massive experiments with threshold blocking. Proceedings of the National Academy of Sciences of the United States of America, 113(27), 7369-7376. doi:10.1073/pnas.1510504113

Imai, K., King, G., & Stuart, E. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481-502. doi:10.1111/j.1467-985X.2007.00527.x

Jordan, P., Albacete, P., & Katz, S. (2015). When is it helpful to restate student responses within a tutorial dialogue system? In C. Conati, N. Heffernan, A. Mitrovic, & M. Verdejo (Ed.), Artificial Intelligence in Education. AIED 2015. Lecture Notes in Computer Science. 9112, pp. 658-661. Springer. doi:10.1007/978-3-319-19773-9_85

Kahrimanis, G., Meier, A., Chounta, I.-A., Voyiatzaki, E., Spada, H., Rummel, N., & Avouris, N. (2009). Assessing Collaboration Quality in Synchronous CSCL problem-solving activities: Adaptation and empirical evaluation of a rating scheme. In U. Cress, V. Dimitrova, & M. Specht (Ed.), Learning in the Synergy of Multiple Disciplines. EC-TEL 2009. Lecture Notes in Computer Science. 5794, pp. 267-272. Berlin, Heidelberg: Springer. doi:10.1007/978-3-642-04636-0_25

Khalil, M., & Ebner, M. (2015). Learning analytics: principles and constraints. Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications, (pp. 1326-1336). doi:10.13140/RG.2.1.1733.2083

Klahr, D. & Nigam, M. (2004). The equivalence of learning paths in early science instruction: Effects of direct instruction and discovery learning. Psychological Science, 15(10), 661-667. doi: 10.1111/j.0956-7976.2004.00737.x

Kirchoff, B. K., Delaney, P. F., Horton, M., & Dellinger-Johnston, R. (2014). Optimizing learning of scientific category knowledge in the classroom: The case of plant identification. CBE Life Sciences Education, 13(3), 425–436. doi:10.1187/cbe.13-11-0224

Kizilcec, R., Pérez-Sanagustín, M., & Maldonado, J. (2016). Recommending self-regulated learning strategies does not improve performance in a MOOC. Proceedings of the Third (2016) ACM Conference on Learning at Scale (pp. 101-104). Edinburgh, Scotland: ACM. doi:10.1145/2876034.2893378

Kizilcec, R., & Cohen, G. L. (2017). Eight-minute self-regulation intervention improves educational attainment at scale in individualist but not collectivist cultures. Proceedings of the National Academy of Sciences of the United States of America, 114(17), 4348-4353. doi: 10.1073/pnas.1611898114

Kizilcec, R., & Brooks, C. (2017). Diverse big data and randomized field experiments in MOOCs. In C. Lang, G. Siemens, A. Wise, and D. Gašević (Eds.), Handbook of Learning Analytics (pp. 211-222). Society for Learning Analytics Research.

Kizilcec, R., Pérez-Sanagustín, M., & Maldonado, J. (2016). Recommending self-regulated learning strategies does not improve performance in a MOOC. Proceedings of the Third (2016) ACM Conference on Learning at Scale (pp. 101-104). Edinburgh, Scotland: ACM. doi:10.1145/2876034.2893378

Klein, R., Ratliff, K., Vianello, M., Adams Jr, R., Bahník, Š., Bernstein, M., Bocian, K., Brandt, M., Brooks, B., Brumbaugh, C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W., Devos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E., … & Nosek, B. (2014). Investigating variation in replicability. Social Psychology, 45, 142-152. doi:10.1027/1864-9335/a000178

Koedinger, K. R., Baker, R. S., Cunningham, K., Skogsholm, A., Leber, B., & Stamper, J. (2010). A Data Repository for the EDM community: The PSLC DataShop. In C. Romero, S. Ventura, M. Pechenizkiy, & R. S. Baker, Handbook of Educational Data Mining. Boca Raton, FL: CRC Press.

Koedinger, K., Corbett, A., & Perfetti, C. (2012). The Knowledge-Learning-Instruction Framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive Science, 36(5), 757-798. doi:10.1111/j.1551-6709.2012.01245.x

Koedinger, K., Booth, J., & Klahr, D. (2013). Instructional complexity and the science to constrain it. Science, 342(6161), 935-937. doi:10.1126/science.1238056

Koedinger, K., Stamper, J., McLaughlin, E., & Nixon, T. (2013). Using data-driven discovery of better student models to improve student learning. Proceedings of the International Conference on Artificial Intelligence in Education, (pp. 421-430).

Koedinger, K., D'Mello, S., McLaughlin, E., Pardos, Z., & Rosé, C. (2015). Data mining and education. WIREs Cognitive Science, 6, 333-353. doi:10.1002/wcs.1350

Koedinger, K., & McLaughlin, E. (2016). Closing the loop with quantitative cognitive task analysis. Proceedings of the 9th International Conference on Educational Data Mining, (pp. 412-417).

Kolb, D. (1984). Experiential Learning: Experience as the source of learning and development. New Jersey: Prentice Hall.

Krause, M. (2010). Undergraduates in the archives: Using an assessment rubric to measure learning. The American Archivist, 73, 507-534. doi:10.17723/aarc.73.2.72176h742v20l115

Kumar, V., Clemens, C., & Harris, S. (2015). Causal models and big data learning analytics. In Ubiquitous Learning Environments and Technologies (pp. 31-53). Springer.

Lader, E., Cannon, C., Ohman, E., Newby, L., Sulmasy, D., Barst, R., Fair, J., Flather, M., Freedman, J., Frye, R., Hand, M., Van de Werf, F., Costa, F., & American College of Cardiology Foundation (2004). The clinician as investigator: Participating in clinical trials in the practice setting. Circulation, 109, 2672-2679. doi:10.1161/01.CIR.0000128702.16441.75

Lockyer, L., Heathcote, E., & Dawson, S. (2013). Informing pedagogical action aligning learning analytics with learning design. American Behavioral Scientist, 57(10), 1439-1459. doi:10.1177/0002764213479367

Lodge, J., & Corrin, L. (2017). What data and analytics can and do say about effective learning. npj Science of Learning, 2(5). doi:10.1038/s41539-017-0006-5

Luckin, R., & du Boulay, B. (1999). Ecolab: the development and evaluation of a Vygotskian design framework. International Journal of Artificial Intelligence in Education, 10(2), 198-220.

Macfadyen, L., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & Education, 54(2), 588-599. doi:10.1016/j.compedu.2009.09.008

Maxwell, J. (2004). Causal explanation, qualitative research, and scientific inquiry in education. Educational Researcher, 33(2), 3-11. doi:10.3102/0013189X033002003

Medin, D., & Bang, M. (2014). Who's asking? Native Science, Western Science and Science Education. Cambridge, MA: The MIT Press.

Moore, E. B., Herzog, T. A., & Perkins, K. K. (2013). Interactive simulations as implicit support for guided-inquiry. Chemical Education Research and Practice, 14, 257-268.

Morgan, K., & Rubin, D. (2012). Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 40(2), 1263-1282. doi:10.1214/12-AOS1008

Morrison, K., & van der Werf, G. (2016). Large-scale data,“wicked problems”, and “what works” for educational policy making. Educational Research and Evaluation, 22(5/6), 255-259. doi:10.1080/13803611.2016.1259789

Motz, B., Teague, J., & Shepard, L. (2015). Know thy students: Providing aggregate student data to instructors. EDUCAUSE Review, 3. Retrieved from https://er.educause.edu/articles/2015/3/know-thy-students-providing-aggregate-student-data-to-instructors

Mullet, H., Butler, A., Verdin, B., von Borries, R., & Marsh, E. (2014). Delaying feedback promotes transfer of knowledge despite studentpreferences to receive feedback immediately. Journal of Applied Research in Memory and Cognition, 3, 222-229. doi:10.1016/j.jarmac.2014.05.001

Murnane, R., & Willett, J. (2010). Methods matter: Improving causal inference in educational and social science research. New York: Oxford University Press.

National Research Council. (2012). Discipline-based education research: Understanding and improving learning in undergraduate science and engineering. National Academies Press.

Norris, D., Baer, L., Pugliese, L., & Lefrere, P. (2008). Action analytics: Measuring and improving performance that matters in higher education. EDUCAUSE Review, 43(1), 42-67. Retrieved from https://er.educause.edu:443/articles/2008/1/action-analytics-measuring-and-improving-performance-that-matters-in-higher-education

Patsopoulos, N. (2011). A pragmatic view on pragmatic trials. Dialogues in Clinical Neuroscience, 13(2), 217-224.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press.

Pearl, J., & Verma, T. (1995). A theory of inferred causation. Studies in Logic and the Foundations of Mathematics, 134, 789-811. doi:10.1016/S0049-237X(06)80074-1

Plaisant, C. (2004). The challenge of information visualization evaluation. Proceedings of the 2nd International Working Conference on Advanced Visual Interfaces (pp. 109-116). New York: ACM. doi:10.1145/989863.989880

Rahman, S., Majumder, M., Shaban, S., Rahman, N., Ahmed, M., Abdulrahman, K.B., & D'Souza, U. (2011). Physician participation in clinical research and trials: issues and approaches. Advances in Medical Education and Practice, 2, 85-93. doi:10.2147/AMEP.S14103

Reich, J. (2013). Rebooting MOOC research: Improve assessment, data sharing, and experimental design. Science (Education Forum), 347(6217), 34-35.

Renz, J., Hoffmann, D., Staubitz, T., & Meinel, C. (2016). Using A/B testing in MOOC environments. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (pp. 304-313 ). New York: ACM. doi:10.1145/2883851.2883876

Roediger, H. L., Agarwal, P. K., McDaniel, M. A., & McDermott, K. B. (2011). Test-enhanced learning in the classroom: Long-term improvements from quizzing. Journal of Experimental Psychology: Applied, 17(4), 382–395. doi:10.1037/a0026252

Russo, F. (2010). Causality and causal modeling in the social sciences. Springer.

Severance, C., Hanss, T., & Hardin, J. (2010). IMS Learning Tools Interoperability: Enabling a mash-up approach to teaching and learning tools. Technology, Instruction, Cognition, & Learning, 7, 245-262.

Shadish, W., Campbell, D., & Cook, T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin.

Siemens, G. (2012). Learning analytics: envisioning a research discipline and a domain of practice. Proceedings of the 2nd International Conference on Learning Analytics and Knowledge (pp. 4-8). New York: ACM. doi:10.1145/2330601.2330605

Siemens, G. (2013). Learning analytics: The emergence of a discipline. American Behavioral Scientist, 57(10), 1380-1400. doi:10.1177/0002764213498851

Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search. MIT Press.

Stewart, N., Chandler, J., & Paolacci, G. (2017). Crowdsourcing samples in cognitive science. Trends in Cognitive Sciences, 21(10), 736-748.

Sullivan, G. (2011). Getting off the "Gold Standard": Randomized controlled trials and education research. Jorunal of Graduate Medical Training, 3, 285-289. doi:10.4300/JGME-D-11-00147.1

Tervakari, A., Silius, K., Koro, J., Paukkeri, J., & Pirttila, O. (2014). Usefulness of information visualizations based on educational data. Global Engineering Education Conference (EDUCON) (pp. 142-151). Institute of Electrical and Electronics Engineers.

Tufte, E. (2003). The cognitive style of PowerPoint. Cheshire, CT: Graphics Press.

US Department of Education. (2016). Using evidence to strengthen education investments (Non-regulatory guidance). Washington, DC: US Department of Education. Retrieved from: https://www2.ed.gov/policy/elsec/leg/essa/guidanceuseseinvestment.pdf

US Department of Education. (2017). What Works Clearinghouse Standards Handbook, Version 4.0. Washington, DC: US Department of Education, Institute of Education Sciences. Retrieved from: https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_standards_handbook_v4.pdf

Verbert, K., Duval, E., Klerkx, J., Govaerts, S., & Santos, J. (2013). Learning analytics dashboard applications. American Behavioral Scientist, 57(10), 1500-1509. doi:10.1177/0002764213479363

Wieman, C. E., Adams, W. K., & Perkins, K. K. (2008). PhET: Simulations that enhance learning. Science, 322(5902), 682-683.

Williams, J., & Williams, B. (2013). Using randomized experiments as a methodological and conceptual tool for improving the design of online learning environments. doi:10.2139/ssrn.2535556

Wise, A., & Shaffer, D. (2015). Why theory matters more than ever in the age of big data. Journal of Learning Analytics, 2(2), 5-13. doi:10.18608/jla.2015.22.2

Zheng, Z., Vogelsang, T., & Pinkwart, N. (2015). The impact of small learning group composition on student engagement and success in a MOOC. Proceedings of the 8th International Conference on Educational Data Mining, (pp. 500-503).


DOI: https://doi.org/10.18608/jla.2018.52.4

Share this article: