TY - JOUR
T1 - The promise and peril of using a large language model to obtain clinical information
T2 - ChatGPT performs strongly as a fertility counseling tool with limitations
AU - Chervenak, Joseph
AU - Lieman, Harry
AU - Blanco-Breindel, Miranda
AU - Jindal, Sangita
N1 - Publisher Copyright:
© 2023 The Authors
PY - 2023/9
Y1 - 2023/9
N2 - Objective: To compare the responses of the large language model-based “ChatGPT” to reputable sources when given fertility-related clinical prompts. Design: The “Feb 13” version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 “frequently asked questions (FAQs)” about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion “optimizing natural fertility.” Setting: Academic medical center. Patient(s): Online AI Chatbot. Intervention(s): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. Main Outcome Measure(s): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. For fertility knowledge surveys: Percentile according to published population data. For Committee Opinion: Whether response to conclusions rephrased as questions identified missing facts. Result(s): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from “optimizing natural fertility.” Conclusion(s): A February 2023 version of “ChatGPT” demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.
AB - Objective: To compare the responses of the large language model-based “ChatGPT” to reputable sources when given fertility-related clinical prompts. Design: The “Feb 13” version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 “frequently asked questions (FAQs)” about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion “optimizing natural fertility.” Setting: Academic medical center. Patient(s): Online AI Chatbot. Intervention(s): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. Main Outcome Measure(s): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. For fertility knowledge surveys: Percentile according to published population data. For Committee Opinion: Whether response to conclusions rephrased as questions identified missing facts. Result(s): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from “optimizing natural fertility.” Conclusion(s): A February 2023 version of “ChatGPT” demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.
KW - Artificial intelligence
KW - counseling
KW - fertility knowledge
KW - natural language processing
KW - online
UR - http://www.scopus.com/inward/record.url?scp=85161630677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85161630677&partnerID=8YFLogxK
U2 - 10.1016/j.fertnstert.2023.05.151
DO - 10.1016/j.fertnstert.2023.05.151
M3 - Article
C2 - 37217092
AN - SCOPUS:85161630677
SN - 0015-0282
VL - 120
SP - 575
EP - 583
JO - Fertility and sterility
JF - Fertility and sterility
IS - 3
ER -