The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations

Joseph Chervenak; Harry Lieman; Miranda Blanco-Breindel; Sangita Jindal

doi:10.1016/j.fertnstert.2023.05.151

The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations

Joseph Chervenak, Harry Lieman, Miranda Blanco-Breindel, Sangita Jindal

Obstetrics & Gynecology and Women's Health

Research output: Contribution to journal › Article › peer-review

17 Scopus citations

Abstract

Objective: To compare the responses of the large language model-based “ChatGPT” to reputable sources when given fertility-related clinical prompts. Design: The “Feb 13” version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 “frequently asked questions (FAQs)” about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion “optimizing natural fertility.” Setting: Academic medical center. Patient(s): Online AI Chatbot. Intervention(s): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. Main Outcome Measure(s): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. For fertility knowledge surveys: Percentile according to published population data. For Committee Opinion: Whether response to conclusions rephrased as questions identified missing facts. Result(s): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from “optimizing natural fertility.” Conclusion(s): A February 2023 version of “ChatGPT” demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.

Original language	English (US)
Pages (from-to)	575-583
Number of pages	9
Journal	Fertility and sterility
Volume	120
Issue number	3
DOIs	https://doi.org/10.1016/j.fertnstert.2023.05.151
State	Published - Sep 2023

Keywords

Artificial intelligence
counseling
fertility knowledge
natural language processing
online

ASJC Scopus subject areas

Reproductive Medicine
Obstetrics and Gynecology

Access to Document

10.1016/j.fertnstert.2023.05.151

Cite this

@article{4ff2af81ab324ab6a291875ae8c11448,

title = "The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations",

abstract = "Objective: To compare the responses of the large language model-based “ChatGPT” to reputable sources when given fertility-related clinical prompts. Design: The “Feb 13” version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 “frequently asked questions (FAQs)” about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion “optimizing natural fertility.” Setting: Academic medical center. Patient(s): Online AI Chatbot. Intervention(s): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. Main Outcome Measure(s): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. For fertility knowledge surveys: Percentile according to published population data. For Committee Opinion: Whether response to conclusions rephrased as questions identified missing facts. Result(s): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from “optimizing natural fertility.” Conclusion(s): A February 2023 version of “ChatGPT” demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.",

keywords = "Artificial intelligence, counseling, fertility knowledge, natural language processing, online",

author = "Joseph Chervenak and Harry Lieman and Miranda Blanco-Breindel and Sangita Jindal",

note = "Publisher Copyright: {\textcopyright} 2023 The Authors",

year = "2023",

month = sep,

doi = "10.1016/j.fertnstert.2023.05.151",

language = "English (US)",

volume = "120",

pages = "575--583",

journal = "Fertility and sterility",

issn = "0015-0282",

publisher = "Elsevier Inc.",

number = "3",

}

TY - JOUR

T1 - The promise and peril of using a large language model to obtain clinical information

T2 - ChatGPT performs strongly as a fertility counseling tool with limitations

AU - Chervenak, Joseph

AU - Lieman, Harry

AU - Blanco-Breindel, Miranda

AU - Jindal, Sangita

PY - 2023/9

Y1 - 2023/9

N2 - Objective: To compare the responses of the large language model-based “ChatGPT” to reputable sources when given fertility-related clinical prompts. Design: The “Feb 13” version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 “frequently asked questions (FAQs)” about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion “optimizing natural fertility.” Setting: Academic medical center. Patient(s): Online AI Chatbot. Intervention(s): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. Main Outcome Measure(s): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. For fertility knowledge surveys: Percentile according to published population data. For Committee Opinion: Whether response to conclusions rephrased as questions identified missing facts. Result(s): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from “optimizing natural fertility.” Conclusion(s): A February 2023 version of “ChatGPT” demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.

AB - Objective: To compare the responses of the large language model-based “ChatGPT” to reputable sources when given fertility-related clinical prompts. Design: The “Feb 13” version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 “frequently asked questions (FAQs)” about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion “optimizing natural fertility.” Setting: Academic medical center. Patient(s): Online AI Chatbot. Intervention(s): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. Main Outcome Measure(s): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. For fertility knowledge surveys: Percentile according to published population data. For Committee Opinion: Whether response to conclusions rephrased as questions identified missing facts. Result(s): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from “optimizing natural fertility.” Conclusion(s): A February 2023 version of “ChatGPT” demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.

KW - Artificial intelligence

KW - counseling

KW - fertility knowledge

KW - natural language processing

KW - online

UR - http://www.scopus.com/inward/record.url?scp=85161630677&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85161630677&partnerID=8YFLogxK

U2 - 10.1016/j.fertnstert.2023.05.151

DO - 10.1016/j.fertnstert.2023.05.151

M3 - Article

C2 - 37217092

AN - SCOPUS:85161630677

SN - 0015-0282

VL - 120

SP - 575

EP - 583

JO - Fertility and sterility

JF - Fertility and sterility

IS - 3

ER -

The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this