Qualitative Coding with GPT-4: Where it Works Better

Xiner Liu; Andres Felipe Zambrano; Ryan S. Baker; Amanda Barany; Jaclyn Ocumpaugh; Jiayi Zhang; Maciej Pankiewicz; Nidhi Nasiar; Zhanlan Wei

doi:10.18608/jla.2025.8575

Authors

Xiner Liu University of Pennsylvania https://orcid.org/0009-0004-3796-2251
Andres Felipe Zambrano University of Pennsylvania https://orcid.org/0000-0003-0692-1209
Ryan S. Baker University of Pennsylvania https://orcid.org/0000-0002-3051-3232
Amanda Barany University of Pennsylvania https://orcid.org/0000-0003-2239-2271
Jaclyn Ocumpaugh University of Pennsylvania https://orcid.org/0000-0002-9667-8523
Jiayi Zhang University of Pennsylvania https://orcid.org/0000-0002-7334-4256
Maciej Pankiewicz University of Pennsylvania https://orcid.org/0000-0002-6945-0523
Nidhi Nasiar University of Pennsylvania https://orcid.org/0009-0006-7063-5433
Zhanlan Wei University of Pennsylvania https://orcid.org/0009-0002-3931-6398

DOI:

https://doi.org/10.18608/jla.2025.8575

Keywords:

qualitative coding, GPT-4, large language model, quantitative ethnography, automated coding, research paper

Abstract

This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies — Zero-shot, Few-shot, and Few-shot with contextual information — as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct’s degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on.

References

Alvarez, J. E., & Bast, H. (2017). A review of word embedding and document similarity algorithms applied to academic text [Unpublished bachelor’s thesis]. University of Freiburg.

Amarasinghe, I., Marques, F., Ortiz-Beltrán, A., & Hernández-Leo, D. (2023). Generative pre-trained transformers for coding text data? An analysis with classroom orchestration data. In O. Viberg, I. Jivet, P. J, Munoz-Merino, M. Perifanou, & T. Papathoma (Eds.), Responsive and sustainable educational futures: 18th European Conference on Technology Enhanced Learning, EC-TEL 2023, Aveiro, Portugal, September 4–8, 2023, proceedings (pp. 32–43). Springer Cham. http://dx.doi.org/10.1007/978-3-031-42682-7_3

Asudani, D. S., Nagwani, N. K., & Singh, P. (2023). Impact of word embedding models on text analytics in deep learning environment: A review. Artificial Intelligence Review, 56(9), 10345–10425. http://dx.doi.org/10.1007/s10462-023-10419-1

Barany, A., Nasiar, N., Porter, C., Zambrano, A. F., Andres, A. L., Bright, D., Shah, M., Liu, X., Gao, S., Zhang, J., Mehta, S., Choi, J., Giordano, C., & Baker, R. S. (2024). ChatGPT for education research: Exploring the potential of large language models for qualitative codebook development. In A. M. Olney, I.-A. Chouta, Z. Liu, O. C. Santos, & I. I. Bittencourt (Eds.), Artificial intelligence in education: 25th international conference, AIED 2024, Recife, Brazil, July 8–12, 2024, proceedings, part II (pp. 134–149). Springer Cham. http://dx.doi.org/10.1007/978-3-031-64299-9_10

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165

Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D. W., Hu, X., & Graesser, A. C. (2019). nCoder+: A semantic tool for improving recall of nCoder coding. In B. Eagan, M. Misfeldt, & A. Siebert-Evenstone (Eds.), Advances in quantitative ethnography: First international conference, ICQE 2019, Madison, WI, USA, October 20–22, 2019, proceedings (pp. 41–54). Springer. http://dx.doi.org/10.1007/978-3-030-33232-7_4

Chew, R., Bollenbacher, J., Wenger, M., Speer, J., & Kim, A. (2023). LLM-assisted content analysis: Using large language models to support deductive coding. arXiv. https://doi.org/10.48550/arXiv.2306.14924

Crowston, K., Liu, X., & Allen, E. E. (2010). Machine learning and rule‐based automated coding of qualitative data. In Proceedings of the American Society for Information Science and Technology, 47(1), 1–2. http://dx.doi.org/10.1002/meet.14504701328

Ekin, S. (2023). Prompt engineering for ChatGPT: A quick guide to techniques, tips, and best practices. TechRxiv. https://doi.org/10.36227/techrxiv.22683919.v2

Femepid, S., Hatherleigh, L., & Kensington, W. (2024). Gradual improvement of contextual understanding in large language models via reverse prompt engineering. Authorea. https://doi.org/10.22541/au.172376001.14254079/v1

Gao, J., Choo, K. T. W., Cao, J., Lee, R. K.-W., & Perrault, S. (2023). CoAIcoder: Examining the effectiveness of AI-assisted human-to-human collaboration in qualitative analysis. ACM Transactions on Computer–Human Interaction, 31(1), Article 6. http://dx.doi.org/10.1145/3617362

Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51(12), 2629–2633. http://dx.doi.org/10.1007/s10439-023-03272-4

Herrenkohl, L. R., & Cornelius, L. (2013). Investigating elementary students’ scientific and historical argumentation. Journal of the Learning Sciences, 22(3), 413–461. http://dx.doi.org/10.1080/10508406.2013.799475

Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247. http://dx.doi.org/10.1111/j.1540-5907.2009.00428.x

Hou, C., Zhu, G., Zheng, J., Zhang, L., Huang, X., Zhong, T., Li, S., Du, H., & Ker, C. L. (2024). Prompt-based and fine-tuned GPT models for context-dependent and -independent deductive coding in social annotation. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the 14th Learning Analytics & Knowledge Conference (pp. 518–528). ACM Press. http://dx.doi.org/10.1145/3636555.3636910

Hutt, S., DePiro, A., Wang, J., Rhodes, S., Baker, R. S., Hieb, G., Sethuraman, S., Ocumpaugh, J., & Mills, C. (2024). Feedback on feedback: Comparing classic natural language processing and generative AI to evaluate peer feedback. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the 14th Learning Analytics & Knowledge Conference (pp. 55–65). ACM Press. http://dx.doi.org/10.1145/3636555.3636850

Izu, C., Denny, P., & Roy, S. (2022). A resource to support novices refactoring conditional statements. In B. A. Becker, K. Quille, M.-J. Laakso, E. Barendsen, & Simon (Eds.), ITiCSE ’22: Proceedings of the 27th ACM Conference on Innovation and Technology in Computer Science Education (Vol. 1, pp. 344–350). ACM Press. http://dx.doi.org/10.1145/3502718.3524810

Katz, A., Fleming, G. C., & Main, J. (2024). Thematic analysis with open-source generative AI and machine learning: A new method for inductive qualitative codebook development. arXiv. https://doi.org/10.48550/arXiv.2410.03721

Kirsten, E., Buckmann, A., Mhaidli, A., & Becker, S. (2024). Decoding complexity: Exploring human–AI concordance in qualitative coding. arXiv. https://doi.org/10.48550/arXiv.2403.06607

Kovanović, V., Joksimović, S., Waters, Z., Gašević, D., Kitto, K., Hatala, M., & Siemens, G. (2016). Towards automated content analysis of discussion transcripts: A cognitive presence case. In D. Gašević, G. Lynch, S. Dawson, H. Drachsler, & C. Penstein Rosé (Eds.), LAK ’16: Proceedings of the 6th International Conference on Learning Analytics & Knowledge (pp. 15–24). ACM Press. http://dx.doi.org/10.1145/2883851.2883950

Lane, H. C., Gadbury, M., Ginger, J., Yi, S., Comins, N., Henhapl, J., & Rivera-Rogers, A. (2022). Triggering STEM interest with Minecraft in a hybrid summer camp. Technology, Mind, and Behavior, 3(4). http://dx.doi.org/10.1037/tmb0000077

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), Article 195. http://dx.doi.org/10.1145/3560815

Lo, L. S. (2023a). The art and science of prompt engineering: A new literacy in the information age. Internet Reference Services Quarterly, 27(4), 203–210. http://dx.doi.org/10.1080/10875301.2023.2227621

Lo, L. S. (2023b). The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship, 49(4), Article 102720. http://dx.doi.org/10.1016/j.acalib.2023.102720

Morgan, D. L. (2023). Exploring the use of artificial intelligence for qualitative data analysis: The case of ChatGPT. International Journal of Qualitative Methods, 22. http://dx.doi.org/10.1177/16094069231211248

Nunez‐Mir, G. C., Iannone, B. V., III., Pijanowski, B. C., Kong, N., & Fei, S. (2016). Automated content analysis: Addressing the big literature challenge in ecology and evolution. Methods in Ecology and Evolution, 7(11), 1262–1272. http://dx.doi.org/10.1111/2041-210X.12602

OpenAI. (2022). ChatGPT [Large language model]. https://openai.com/chatgpt

Pankiewicz, M., & Furmańczyk, K. (2020). From zero to hero: Automated formative assessment for supporting student engagement and performance in a gamified online programming course. EdMedia + Innovate Learning, 2020(1), 1252–1261.

Pinto, J. D., Liu, Q., Paquette, L., Zhang, Y., & Fan, A. X. (2023). Investigating the relationship between programming experience and debugging behaviors in an introductory computer science course. In G. A. Irgens & S. Knight (Eds.), Advances in quantitative ethnography: 5th international conference, ICQE 2023, Melbourne, VIC, Australia, October 8–12, 2023, proceedings (pp. 125–139). Springer Cham. http://dx.doi.org/10.1007/978-3-031-47014-1_9

Prabhumoye, S., Kocielnik, R., Shoeybi, M., Anandkumar, A., & Catanzaro, B. (2021). Few-shot instruction prompts for pretrained language models to detect social biases. arXiv. https://doi.org/10.48550/arXiv.2112.07868

Saldaña, J. (2016). The coding manual for qualitative researchers (3rd ed.). SAGE Publications.

Shaffer, D. W., & Ruis, A. R. (2021). How we code. In A. R. Ruis & S. B. Lee (Eds.), Advances in quantitative ethnography: Second international conference, ICQE 2020, Malibu, CA, USA, February 1–3, 2021, proceedings (pp. 62–77). Springer Cham. http://dx.doi.org/10.1007/978-3-030-67788-6_5

Shapiro, G. (1997). The future of coders: Human judgments in a world of sophisticated software. In C. W. Roberts (Ed.), Text analysis for the social sciences: Methods for drawing statistical inferences from texts and transcripts (pp. 225–238). Routledge. http://dx.doi.org/10.4324/9781003064060-16

Sherin, B. (2012). Using computational methods to discover student science conceptions in interview data. In S. Dawson, C. Haythornthwaite, S. Buckingham Shum, D. Gašević, & R. Ferguson (Eds.), LAK ’12: Proceedings of the 2nd International Conference on Learning Analytics & Knowledge (pp. 188–197). ACM Press. http://dx.doi.org/10.1145/2330601.2330649

Spasić, A. J., & Janković, D. S. (2023). Using ChatGPT standard prompt engineering techniques in lesson preparation: Role, instructions and seed-word prompts. In N. Dončov, Z. Ž. Stanković, & B. P. Stošić (Eds.), 2023 58th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST): Proceedings of Papers (pp. 47–50). IEEE. http://dx.doi.org/10.1109/ICEST58410.2023.10187269

Tai, R. H., Bentley, L. R., Xia, X., Sitt, J. M., Fankhauser, S. C., Chicas-Mosier, A. M., & Monteith, B. G. (2024). An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods, 23. http://dx.doi.org/10.1177/16094069241231168

Theelen, H., Vreuls, J., & Rutten, J. (2024). Doing research with help from ChatGPT: Promising examples for coding and inter-rater reliability. International Journal of Technology in Education, 7(1), 1–18. http://dx.doi.org/10.46328/ijte.537

Thornberg, R., & Charmaz, K. (2014). Grounded theory and theoretical coding. In U. Flick (Ed.), The SAGE handbook of qualitative data analysis, 153–169. http://dx.doi.org/10.4135/9781446282243.n11

Weber, R. P. (1984). Computer-aided content analysis: A short primer. Qualitative Sociology, 7(1), 126–147. https://doi.org/10.1007/BF00987112

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv. https://doi.org/10.48550/arXiv.2302.11382

Xiao, Z., Yuan, X., Liao, Q. V., Abdelghani, R., & Oudeyer, P.-Y. (2023). Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In F. Chen, M. Billinghurst, M. Zhou, & S. Berkovsky (Eds.), IUI ’23 companion: Companion proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 75–78). ACM Press. https://doi.org/10.1145/3581754.3584136

Yi, S., Gadbury, M., & Lane, H. C. (2020). Coding and analyzing scientific observations from middle school students in Minecraft. In M. Gresalfi & I. S. Horn (Eds.), The interdisciplinarity of the learning sciences: 14th International Conference of the Learning Sciences (ICLS) 2020 (Vol. 3, pp. 1787–1788). International Society of the Learning Sciences. https://repository.isls.org//handle/1/6443

Zambrano, A. F., Liu, X., Barany, A., Baker, R. S., Kim, J., & Nasiar, N. (2023). From nCoder to ChatGPT: From automated coding to refining human coding. In G. A. Irgens & S. Knight (Eds.), Advances in quantitative ethnography: 5th international conference, ICQE 2023, Melbourne, VIC, Australia, October 8–12, 2023, proceedings (pp. 470–485). Springer Cham. https://doi.org/10.1007/978-3-031-47014-1_32

Zambrano, A. F., Pankiewicz, M., Barany, A., & Baker, R. S. (2024). Ordered network analysis in CS education: Unveiling patterns of success and struggle in automated programming assessment. In M. Monga, V. Lonati, E. Barendsen, J. Sheard, J. Patterson, & L. Barker (Eds.), ITiCSE 2024: Proceedings of the 2024 Conference on Innovation and Technology in Computer Science Education (Vol. 1, pp. 443–449). ACM Press. http://dx.doi.org/10.1145/3649217.3653613

Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., Yang, L., Zhang, W., Jiang, J., & Cui, B. (2024). Retrieval-augmented generation for AI-generated content: A survey. arXiv. https://doi.org/10.48550/arXiv.2402.19473

Qualitative Coding with GPT-4

Where it Works Better

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)