Generative AI and Extension Program Planning and Evaluation: Do General Large Language Models Answer Domain-specific Questions Accurately?

Emmanuel Anobir Mensah; Kirk A. Swortzel

doi:10.5032/jae.v67i1.3289

Authors

Emmanuel Anobir Mensah Mississippi State University https://orcid.org/0000-0002-6096-4747
Kirk A. Swortzel Mississippi State University https://orcid.org/0000-0002-3774-3591

DOI:

https://doi.org/10.5032/jae.v67i1.3289

Keywords:

ChatGPT, Extension Program Planning and Evaluation (EPPE), Accuracy, Gen AI, Cooperative Extension Service

Abstract

Generative artificial intelligence (Gen AI) promises to revolutionize Extension program planning and evaluation (EPPE) by enabling Extension professionals to assess needs, design, and evaluate programs efficiently. However, research has yet to ascertain the accuracy of Gen AI, specifically Large Language Models (LLMs) like ChatGPT, in responding to EPPE-related prompts. This study employed expert judgment to assess the accuracy of ChatGPT responses to EPPE-specific prompts from the perspectives of two EPPE specialists, professors, and faculty who work with the United States Cooperative Extension Service (CES). Using frequencies, percentages, Cohen’s kappa, and inductive content categorization to analyze the data, the study showed that experts rated ChatGPT’s responses “partially correct” for 60% of the prompts, with the remaining responses rated either “correct” or “mixed.” None of the responses were rated as “irrelevant,” indicating that ChatGPT’s responses were consistently relevant for all the EPPE topics. However, the inter-rater agreement was low (Cohen’s k = .38, p=.025), revealing variability in expert judgment. Inaccuracies in ChatGPT’s responses resulted from a mismatch with technical evaluation standards and insufficient contextual information. In conclusion, ChatGPT demonstrates potential as a support tool for EPPE, however, expert oversight, responsible and ethical use, and a chatbot trained on research-based EPPE data could enhance its response accuracy and reliability. We recommend the implementation of Extension capacity-building initiatives that build professionals' capacity to use Gen AI responsibly and ethically, examine Gen AI’s responses critically, blend expertise with AI-generated responses, and write effective prompts (prompt engineering) to enhance Gen AI's potential utility in EPPE.

Downloads

Download data is not yet available.

References

Archer, T. M., Warner, P. D., Miller, W., & Clark, C. D. (2007). Can we define and measure excellence in extension? The Journal of Extension, 45(1). https://open.clemson.edu/joe/vol45/iss1/2

Bandi, A., Adapa, P. V. S. R., & Kuchi, Y. E. V. P. K. (2023). The power of generative AI: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet, 15(8). https://doi.org/10.3390/fi15080260 DOI: https://doi.org/10.3390/fi15080260

Chan, C. K. Y., & Hu, W. (2023). Students’ voices on generative AI: Perceptions, benefits, and challenges in higher education. International Journal of Educational Technology in Higher Education, 20(1). https://doi.org/10.1186/s41239-023-00411-8 DOI: https://doi.org/10.1186/s41239-023-00411-8

Chan, C. K. Y., & Lee, K. K. W. (2023). The AI generation gap: Are Gen Z students more interested in adopting generative AI such as ChatGPT in teaching and learning than their Gen X and millennial generation teachers? Smart Learning Environments, 10(1). https://doi.org/10.1186/s40561-023-00269-3 DOI: https://doi.org/10.1186/s40561-023-00269-3

Chiu, T. K. F. (2023). The impact of Generative AI (GenAI) on practices, policies and research direction in education: A case of ChatGPT and Midjourney. Interactive Learning Environments, 32(10), 1–17. https://doi.org/10.1080/10494820.2023.2253861 DOI: https://doi.org/10.1080/10494820.2023.2253861

Dash, T., Chitlangia, S., Ahuja, A., & Srinivasan, A. (2022). A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-021-04590-0 DOI: https://doi.org/10.1038/s41598-021-04590-0

Diaz, J. M., Jayaratne, K. S. U., & Kumar Chaudhary, A. (2020). Evaluation competencies and challenges faced by early career extension professionals: Developing a competency model through consensus building. The Journal of Agricultural Education and Extension, 26(2), 183–201. https://doi.org/10.1080/1389224X.2019.1671204 DOI: https://doi.org/10.1080/1389224X.2019.1671204

Donaldson, J. L., Graham, D., Arnold, S., Taylor, L. K., & Jayaratne, K. S. U. (2023). An analytic needs assessment for Extension education: Views from Extension professionals and Faculty. Journal of Human Sciences and Extension, 11(1). https://orcid.org/0000-0002-9276-3747 DOI: https://doi.org/10.55533/2325-5226.1388

Donaldson, J., & Vaughan, R. (2022). A scoping study of United States Extension professional competencies. Journal of Human Sciences and Extension, 10(1). https://doi.org/10.54718/BNRG8317 DOI: https://doi.org/10.54718/BNRG8317

Einarsson, H., Lund, S. H., & Jónsdóttir, A. H. (2024). Application of ChatGPT for automated problem reframing across academic domains. Computers and Education: Artificial Intelligence, 6, 100194. https://doi.org/10.1016/j.caeai.2023.100194 DOI: https://doi.org/10.1016/j.caeai.2023.100194

Elo, S., Kääriäinen, M., Kanste, O., Pölkki, T., Utriainen, K., & Kyngäs, H. (2014). Qualitative content analysis: A focus on trustworthiness. Sage Open, 4(1). https://doi.org/10.1177/2158244014522633 DOI: https://doi.org/10.1177/2158244014522633

Elo, S., & Kyngäs, H. (2008). The qualitative content analysis process. Journal of Advanced Nursing, 62(1). https://doi.org/10.1111/j.1365-2648.2007.04569.x DOI: https://doi.org/10.1111/j.1365-2648.2007.04569.x

Ghebrehiwet, I., Zaki, N., Damseh, R., & Mohamad, M. S. (2024). Revolutionizing personalized medicine with generative AI: A systematic review. Artificial Intelligence Review, 57(5). https://doi.org/10.1007/s10462-024-10768-5 DOI: https://doi.org/10.1007/s10462-024-10768-5

Haupt, C. E., & Marks, M. (2023). AI-generated medical advice—GPT and beyond. Jama, 329(16), 1349–1350. https://doi.org/10.1001/jama.2023.5321 DOI: https://doi.org/10.1001/jama.2023.5321

Hill, P., & Narine, L. (2023). Ensuring responsible and transparent use of generative AI in Extension. The Journal of Extension, 61(2). https://doi.org/10.34068/joe.61.02.13 DOI: https://doi.org/10.34068/joe.61.02.13

Hill, P., Narine, L., & Miller, A. (2024). Prompt engineering principles for Generative AI use in Extension. The Journal of Extension, 62(3). https://open.clemson.edu/joe/vol62/iss3/20 DOI: https://doi.org/10.34068/joe.62.03.20

Holzinger, A., Zatloukal, K., & Müller, H. (2025). Is human oversight to AI systems still possible? New Biotechnology, 85, 59–62. https://doi.org/10.1016/j.nbt.2024.12.003 DOI: https://doi.org/10.1016/j.nbt.2024.12.003

Jeblick, K., Schachtner, B., Dexl, J., Mittermeier, A., Stüber, A. T., Topalis, J., Weber, T., Wesp, P., Sabel, B. O., Ricke, J., & Ingrisch, M. (2024). ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports. European Radiology, 34(5), 2817–2825. https://doi.org/10.1007/s00330-023-10213-1 DOI: https://doi.org/10.1007/s00330-023-10213-1

Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., Donald, R., Chang, S., Berkowitz, S., Finn, A., Jahangir, E., Scoville, E., Reese, T., Friedman, D., Bastarache, J., van der Heijden, Y., Wright, J., Carter, N., Alexander, M., Choe, J., … Wheless, L. (2023). Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Research Square, rs.3.rs-2566942. https://doi.org/10.21203/rs.3.rs-2566942/v1 DOI: https://doi.org/10.21203/rs.3.rs-2566942/v1

Johnson, D. M., Doss, W., & Estepp, C. M. (2024). Agriculture students’ use of generative artificial intelligence for microcontroller programming. Natural Sciences Education, 53(2). https://doi.org/10.1002/nse2.20155 DOI: https://doi.org/10.1002/nse2.20155

Kavadella, A., Dias da Silva, M. A., Kaklamanos, E. G., Stamatopoulos, V., & Giannakopoulos, K. (2024). Evaluation of ChatGPT’s real-life implementation in undergraduate Dental Education: Mixed methods study. JMIR Medical Education, 10. https://doi.org/10.2196/51344 DOI: https://doi.org/10.2196/51344

KPMG Corporate Communications. (2025). The American Trust in AI Paradox: Adoption Outpaces Governance. https://kpmg.com/us/en/media/news/trust-in-ai-2025.html

Kpodo, J., Kordjamshidi, P., & Nejadhashemi, A. P. (2024). AgXQA: A benchmark for advanced Agricultural Extension question answering. Computers and Electronics in Agriculture, 225, 109349. https://doi.org/10.1016/j.compag.2024.109349 DOI: https://doi.org/10.1016/j.compag.2024.109349

Kuşcu, O., Pamuk, A. E., Sütay Süslü, N., & Hosal, S. (2023). Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Frontiers in Oncology, 13. https://doi.org/10.3389/fonc.2023.1256459 DOI: https://doi.org/10.3389/fonc.2023.1256459

Lavrič, F., & Škraba, A. (2023). Brainstorming will never be the same again—A human group supported by artificial intelligence. Machine Learning and Knowledge Extraction, 5(4). https://doi.org/10.3390/make5040065 DOI: https://doi.org/10.3390/make5040065

Lee, J., Park, S., Shin, J., & Cho, B. (2024). Analyzing evaluation methods for large language models in the medical field: A scoping review. BMC Medical Informatics and Decision Making, 24(1). https://doi.org/10.1186/s12911-024-02709-7 DOI: https://doi.org/10.1186/s12911-024-02709-7

Lee, T.-C., Staller, K., Botoman, V., Pathipati, M. P., Varma, S., & Kuo, B. (2023). ChatGPT answers common patient questions about colonoscopy. Gastroenterology, 165(2), 509–511. https://doi.org/10.1053/j.gastro.2023.04.033. DOI: https://doi.org/10.1053/j.gastro.2023.04.033

Lincoln, Y. S., & Guba, E. G. (with Internet Archive). (1985). Naturalistic inquiry. Beverly Hills, Calif. : Sage Publications. http://archive.org/details/naturalisticinqu00linc DOI: https://doi.org/10.1016/0147-1767(85)90062-8

Mahmoudi Ghehsareh, M., Asri, N., Azizmohammad Looha, M., Sadeghi, A., Ciacci, C., & Rostami-Nejad, M. (2025). Expert evaluation of ChatGPT accuracy and reliability for basic celiac disease frequently asked questions. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-15898-6 DOI: https://doi.org/10.1038/s41598-025-15898-6

McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282. https://doi.org/10.11613/BM.2012.031 DOI: https://doi.org/10.11613/BM.2012.031

Mensah, E. A., & Osman, N. (2024). Integrating Generative AI Tools into Extension Program Planning and Evaluation | Mississippi State University Extension Service. Mississippi State University Extension. https://extension.msstate.edu/publications/integrating-generative-ai-tools-extension-program-planning-and-evaluation

Mittal, U., Sai, S., & Chamola, V. (2024). A comprehensive review on generative ai for education. IEEE Access. https://ieeexplore.ieee.org/abstract/document/10695056/ DOI: https://doi.org/10.1109/ACCESS.2024.3468368

Narine, L. K., & Ali, A. D. (2020). Assessing priority competencies for evaluation capacity building in Extension. Journal of Human Sciences and Extension, 8(3). https://doi.org/10.54718/OKVW8622 DOI: https://doi.org/10.54718/OKVW8622

Olanike, D., Alabi, D., Famakinwa, M., & Faniyi, E. (2023). Utilisation of Artificial Intelligence-based technology for agricultural extension services among Extension professionals in Nigeria. Journal of Agricultural Extension, 27, 80–92. https://doi.org/10.4314/jae.v27i3.9 DOI: https://doi.org/10.4314/jae.v27i3.9

Prestegaard-Wilson, J., & Vitale, J. (2024). Generative artificial intelligence in extension: A new era of support for livestock producers. Animal Frontiers, 14(6), 57–59. https://doi.org/10.1093/af/vfae024 DOI: https://doi.org/10.1093/af/vfae024

Samaan, J. S., Yeo, Y. H., Rajeev, N., Hawley, L., Abel, S., Ng, W. H., Srinivasan, N., Park, J., Burch, M., Watson, R., Liran, O., & Samakar, K. (2023). Assessing the accuracy of responses by the Language Model ChatGPT to questions regarding Bariatric surgery. Obesity Surgery, 33(6), 1790–1796. https://doi.org/10.1007/s11695-023-06603-5 DOI: https://doi.org/10.1007/s11695-023-06603-5

Schlagwein, D., & Willcocks, L. (2023). ‘ChatGPT et al.’: The ethics of using (generative) artificial intelligence in research and science. Journal of Information Technology, 38(3), 232–238. https://doi.org/10.1177/02683962231200411 DOI: https://doi.org/10.1177/02683962231200411

Siu, H. C., Suh, A., Smith, N., & Hurley, I. (2025). “Explainable” AI has some explaining to do. MIT Case Studies in Social and Ethical Responsibilities of Computing, Winter 2025. https://doi.org/10.21428/2c646de5.e8e32375 DOI: https://doi.org/10.21428/2c646de5.e8e32375

Tzachor, A., Devare, M., Richards, C., Pypers, P., Ghosh, A., Koo, J., Johal, S., & King, B. (2023). Large language models and agricultural extension services. Nature Food, 4(11), 941–948. https://doi.org/10.1038/s43016-023-00867-x DOI: https://doi.org/10.1038/s43016-023-00867-x

Wysocka, M., Wysocki, O., Delmas, M., Mutel, V., & Freitas, A. (2024). Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation. Journal of Biomedical Informatics, 158, 104724. https://doi.org/10.1016/j.jbi.2024.104724 DOI: https://doi.org/10.1016/j.jbi.2024.104724

Yeo, Y. H., Samaan, J. S., Ng, W. H., Ting, P.-S., Trivedi, H., Vipani, A., Ayoub, W., Yang, J. D., Liran, O., & Spiegel, B. (2023). Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clinical and Molecular Hepatology, 29(3), 721. https://doi.org/10.3350/cmh.2023.0089 DOI: https://doi.org/10.3350/cmh.2023.0089