Brown Neurosurgery tests AI on written and oral neurosurgery board exam questions 2023
The Brown Neurosurgery Department recently published two preprints comparing Artificial Intelligence Large Language Models ChatGPT, GPT-4, and Google Bard in the neurosurgery written board examinations and oral board preparatory question bank.
These AI models passed written tests “flying colors.” Ziya Gokaslan, professor and chair of neurosurgery at the Warren Alpert Medical School and neurosurgeon-in-chief at Rhode Island Hospital and The Miriam Hospital, said the models performed “superbly” when asked more difficult oral exam questions that require higher-order thinking based on clinical experience.
The oral board exam preprint has placed in the 99th percentile of the Altmetric Attention Score, which has tracked over 23 million online research outputs.
“It’s such an exploding story in the world and in medicine,” said Warren Alpert Professor of Neurosurgery Albert Telfeian, director of minimally invasive endoscopic spine surgery at RIH and pediatric neurosurgery at Hasbro Children’s Hospital.
Study motivation and findings
Rohaid Ali, a fifth-year neurosurgery resident and co-first author, was studying for his board exam with his Stanford Medical School classmate Ian Connolly, a fourth-year neurosurgery resident at Massachusetts General Hospital. They wanted to see if ChatGPT could answer their exam questions after seeing it pass previous standardized exams like the bar exam.
Ali, Connolly, and Oliver Tang ’19 MD’23 collaborated on these studies. GPT-4 was “better than the average human test taker” and ChatGPT and Google Bard were “level of the average neurosurgery resident who took these mock exams,” Ali claimed.
“One of the most interesting aspects” of the study was comparing the AI models, as there have been “very few structured head-to-head comparisons of (them) in any fields,” said Wael Asaad, associate professor of neurosurgery and neuroscience at Warren Alpert and director of the functional epilepsy and neurosurgery program at RIH. “Really exciting beyond just neurosurgery,” he said.
GPT-4 scored 82.6% on higher-order case management scenarios in mock neurosurgery oral board exam questions, outperforming the other LLMs.
Asaad predicted GPT-4 to outperform ChatGPT, which came before it, and Google Bard. “Google sort of rushed to jump in and… that rush shows in (Google Bard) doesn’t perform nearly as well.”
These models have limitations: Text-based models performed poorly on imaging-related higher-order reasoning issues because they cannot perceive visuals. Answering these queries with “hallucinations,” they also lied.
One question showed a highlighted arm and questioned which nerve supplied the sensory distribution. Ali stated Google Bard gave a “completely made up” answer, whereas GPT-4 accurately determined that it could not answer since it is a text-based model and cannot see the image.
“It’s important to address the viral social media attention that these (models) have gained, which suggests that (they) could be a brain surgeon, but also important to clarify that these models are not yet ready for primetime and should not be considered a replacement for human activities,” Ali said. “As neurosurgeons, we must safely integrate AI models for patient use and actively investigate their blind spots to provide the best care.”
Neurosurgeons may get false or irrelevant information in clinical settings, Asaad said. “LLMs don’t perform very well in these real-world scenarios that are more open-ended and less clear cut,” he said.
Medical AI ethics
The AI model’s right responses astonished researchers.
For a severe gunshot damage to the head, the answer was that surgery would likely not change the illness path. “Fascinatingly, these AI chatbots chose that answer,” Ali remarked.
“That’s something we didn’t expect (and) worth considering,” Ali said. What would happen if these AI models gave us ethical advice in this area?
Another issue is that these models are based on data from therapeutic trials that underrepresented disadvantaged areas. “We must be vigilant about potential risks of propagating health disparities and address these biases… to prevent harmful recommendations,” Ali said.
“It’s not something that’s unique to those systems—a lot of humans have bias—so it’s just a matter of trying to understand that bias and engineer it out of the system,” Asaad said.
Telfeian stressed the importance of doctor-patient relationships that AI models lack. “If your doctor established some common ground with you—to say ‘oh, you’re from here, or you went to this school’—then suddenly you’re more willing to accept what they would recommend,” he said.
“Taking the surgeon out of the equation is not in the foreseeable future,” said Warren Alpert professor of neurosurgery and RIH cerebrovascular surgery director Curt Doberstein. “AI can help patients and doctors, but it doesn’t have many capabilities yet.”
Medical AI future
Asaad anticipated that AI models in medicine would “slowly dial back the human factor, and anybody who doesn’t see it that way, who thinks that there’s something magical about what humans do… is missing the deeper picture of what it means to have intelligence.”
“Intelligence isn’t magic. “It’s just a process we’re learning to replicate in artificial systems,” Asaad said.
Asaad also envisions AI helping doctors.
He added clinicians can’t keep up with new advancements that might assist them evaluate cases because medicine is changing quickly. AI models could “give you ideas or resources that are relevant to your clinical problem.”
AI-assisted patient documentation and communication can reduce provider burnout, improve patient safety, and improve doctor-patient interactions, according to Doberstein.
“There’s no question that these systems will find their way into medicine and surgery, and I think they’re going to be extremely helpful, but I think we need to be careful in testing these effectively and using it thoughtfully,” Gokaslan said.
“We’re at the tip of the iceberg—these things just came out,” Doberstein said. “Everyone in science will constantly have to learn and adapt to new technology and changes.”
“That’s exciting,” Doberstein said.