https://www.emergentmind.com/papers/2405.07960
Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in AI and LLMs promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at https://AgentClinic.github.io.
Comparison of doctor language models' accuracy and GPT-4 performance on AgentClinic-MedQA under different conditions.
AgentClinic is an open-source benchmark platform designed to simulate real-world clinical environments, enabling testing and evaluation of language models in medical contexts.
The AgentClinic benchmark utilizes multiple agent roles like Doctor, Patient, Measurement, and Moderator Agents to create a dynamic and realistic medical consultation simulation, incorporating biases to paper their effects on diagnostics.
The platform tests multimodal language models, exploring their abilities to integrate visual data with textual interactions, while aiming to enhance AI's capability in healthcare through realistic, dynamic simulations.
AgentClinic is an open-source multimodal agent benchmark designed to simulate real-world clinical environments using language agents. This benchmark platform introduces unique features like multimodal interactions, bias incorporation, and complex agent roles to create a comprehensive and challenging environment for testing LLMs within a medical context.
AgentClinic utilizes four types of language agents to drive its simulated medical platform:
This diversity in agent roles allows AgentClinic to mimic the flow of real medical consultations more closely than previous benchmarks which mainly relied on static Q&A formats.