February 22, 2024

Studies show that AI chatbots provide inconsistent accuracy for musculoskeletal health information

  • Researchers agree: orthopedic surgeons remain the most reliable source of information
  • All chatbots showed significant limitations and omitted crucial processing steps
  • Researchers summarize: ChatGPT is still not an adequate source to answer patient questions; Further work is needed to develop an accurate orthopedic-focused chatbot

SAN FRANCISCO, February 12, 2024 /PRNewswire/ — With the growing popularity of the large language model (LLM) chatbotsa type of artificial intelligence (AI) used by ChatGPTGoogle Bard and BingAIit is important to ensure the accuracy of outlining musculoskeletal system health information they provide. Three new studies presented at the 2024 annual meeting American Academy of Orthopedic Surgeons (AAOS) analyzed the validity of the information chatbots given to patients for certain orthopedic procedures, with their accuracy assessed chatbots present scientific advances and clinical decision making.

While the studies found that certain chatbots provide concise summaries across a broad spectrum of orthopedic conditions, they all showed limited accuracy depending on the category. Researchers agree that orthopedic surgeons remain the most reliable source of information. The findings will help those in the field understand the efficacy of these AI tools, if their use by patients or non-specialist colleagues could introduce biases or misconceptions, and how future improvements could make chatbots a potentially valuable tool for patients and doctors can make.

Potential misinformation and dangers associated with the clinical use of LLM chatbots
This research, led by Branden Sosa, a fourth-year medical student at Weill Cornell Medicine, assessed the accuracy of Open AI ChatGPT 4.0, Google Bard, and BingAI chatbots in explaining basic orthopedic concepts, integrating clinical information, and answering patient questions. Each chatbot was asked to answer 45 orthopedic-related questions, spanning the categories ‘Bone Physiology’, ‘Referring Physician’ and ‘Patient Question’, and then rated for accuracy. Two independent, blinded reviewers scored the responses on a scale of 0-4, assessing accuracy, completeness, and usefulness. The responses were analyzed for strengths and weaknesses within categories and between chatbots. The research team noted the following trends:

  • When orthopedic questions were asked, OpenAI ChatGPT, Google Bard, and BingAI provided correct answers covering the most critical salient points in 76.7%, 33%, and 16.7% of questions, respectively.
  • When providing clinical management suggestions, all chatbots showed significant limitations by deviating from the standard of care and omitting crucial steps in the study, such as ordering antibiotics before cultures or failing to include important tests in the diagnostic study.
  • When asking less complex patient questions, ChatGPT and Google Bard were usually able to provide accurate answers, but often failed to uncover the critical medical history relevant to fully answering the question.
  • A careful analysis of quotes from chatbots revealed oversampling of a small number of references and ten erroneous links that either did not work or led to incorrect articles.

Is ChatGPT ready for prime time? Assessing the accuracy of AI in answering common patient questions about arthroplasty
Researchers, led by Jenna A. Bernstein, MD, an orthopedic surgeon at Connecticut Orthopedics, sought to investigate how accurately ChatGPT 4.0 answered patient questions by developing a list of 80 frequently asked questions from patients about knee and hip replacements. Each question was requested twice in ChatGPT; first asking the questions as written, and then prompting the ChatGPT to answer the patient’s questions “as an orthopedic surgeon.” Each surgeon on the team evaluated the accuracy of each set of answers and rated them on a scale of one to four. Agreement was noted between the two surgeons’ evaluation of each set of ChatGPT responses. The relationship between question prompt and response accuracy was both assessed using two modes of statistical analysis (Cohen’s Kappa and Wilcoxon Signed-Rank test, respectively). The findings include:

  • When assessing the quality of ChatGPT responses, 26% (21 of 80 responses) had an average scale of three (partly accurate, but incomplete) or less when asked without prompting, and 8% (six of 80 responses) had an average rating of less than three when preceded by a prompt. As such, researchers summarize that ChatGPT is still not an adequate resource to answer patient questions and that further work is needed to develop an accurate orthopedic-focused chatbot.
  • ChatGPT performed significantly better when properly asked to answer patient questions “as an orthopedic surgeon” with 92% accuracy.

Can ChatGPT 4.0 be used to answer patient questions about the Latarjet procedure for anterior shoulder instability?
Researchers from the Hospital for Special Surgery in New Yorkled by Kyle KunzeMD, assessed the propensity for ChatGPT 4.0 to provide medical information about the Latarjet procedure for patients with anterior shoulder instability. The overall goal of this study was to understand whether this is the case chatbot could demonstrate the potential to serve as a clinical tool and assist both patients and healthcare providers by providing accurate medical information.

To answer this question, the team first conducted a Google search using the query “Latarjet” to find the top ten frequently asked questions (FAQs) and associated resources about the procedure. They then asked ChatGPT to perform the same search for frequently asked questions to identify the chatbot’s questions and resources. Highlights of the findings included:

  • ChatGPT demonstrated the ability to provide a wide range of clinically relevant questions and answers and derived information from academic sources 100% of the time. This is in contrast to Google, which included a small percentage of academic sources combined with information found on personal websites of surgeons and larger medical practices.
  • The most common question category for both ChatGPT and Google was technical details (40%); However, ChatGPT also presented information on risks/complications (30%), recovery timeline (20%), and evaluation of the surgery (10%).

# # #

2024 AAOS Annual Meeting Disclosure Statement

About the AAOS
With more than 39,000 members, the American Academy of Orthopedic Surgeons is the world’s largest medical association of musculoskeletal specialists. The AAOS is the trusted leader in promoting musculoskeletal health. It provides the highest quality and most comprehensive training to help orthopedic surgeons and paramedics at all career levels best treat patients in their daily practice. The AAOS is the source for information about bone and joint disorders, treatments and related issues in musculoskeletal health care; and it guides the healthcare discussion on promoting quality.

Follow the AAOS Facebook, X, LinkedIn And Instagram.

SOURCE American Academy of Orthopedic Surgeons

Leave a Reply

Your email address will not be published. Required fields are marked *