Abstract
The rapid advancement of Large Language Models (LLMs) has spurred interest in multi-agent collaboration for tackling complex medical tasks. However, the practical advantages of multi-agent collaboration approaches are not yet fully understood, as existing evaluations often lack generalizability across diverse clinical applications and omit rigorous comparisons against both single LLMs and well-established conventional methods.
To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, utilizing text, medical images, and structured EHR data.
Our extensive experiments reveal a nuanced landscape: while multi-agent frameworks demonstrate benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, they do not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods, which maintain superior performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard provides a vital resource and actionable insights, emphasizing the need for a task-specific, evidence-based approach to selecting and developing AI solutions in medicine, and underscoring that the additional complexity of multi-agent collaboration must be carefully justified against its tangible benefits.
MedAgentBoard Framework
Medical (Visual) Question Answering
This task evaluates the ability of AI systems to answer questions based on medical textual knowledge (QA) or a combination of visual and textual inputs (VQA), such as radiological images or pathology slides.
Datasets
- MedQA
- PubMedQA
- PathVQA
- VQA-RAD
Evaluations
- Accuracy (multiple-choice)
- LLM-as-a-judge scoring
- Clinical relevance
- Factual consistency
Methods
- Conventional: BioLinkBERT, GatorTron, M³AE
- Single LLM: Zero-shot, ICL, CoT
- Multi-agent: MedAgents, ReConcile
Lay Summary Generation
Lay summary centers on transforming complex medical texts, such as research articles, into versions that are accurate, concise, and readily comprehensible to a non-expert broader audience.
Datasets
- Cochrane
- eLife and PLOS
- Med-EASi
- PLABA
Evaluations
- ROUGE-L
- SARI
Methods
- Conventional: T5, PEGASUS, BART
- Single LLM: Zero-shot, ICL
- Multi-agent: AgentSimp-inspired
EHR Predictive Modeling
This task focuses on predicting patient-specific clinical outcomes using structured EHR data, including in-hospital patient mortality and 30-day hospital readmission.
Datasets
- MIMIC-IV
- Tongji Hospital (TJH)
Evaluations
- AUROC
- AUPRC
Methods
- Conventional: DT, XGBoost, GRU, LSTM
- Single LLM: Zero-shot prompting
- Multi-agent: QA-like frameworks
Clinical Workflow Automation
Clinical workflow automation evaluates AI systems' capabilities in handling routine to complex clinical data analysis tasks that traditionally require significant clinical expertise.
Datasets
- MIMIC-IV
- Tongji Hospital (TJH)
- 100 analytical questions
Evaluations
- Expert assessment
- Correctness of data extraction
- Appropriateness of modeling
- Quality of visualization
- Completeness of reports
Methods
- Single LLM
- Multi-agent: SmolAgents, OpenManus, Owl
Key Findings
Medical QA and VQA Findings
- ◆Medical QA: LLM-based approaches, particularly with advanced prompting, significantly outperform conventional methods. While multi-agent collaboration frameworks are competitive, highly capable single LLMs can achieve SOTA results.
- ◆Medical VQA: Specialized conventional Vision-Language Models remain dominant, likely due to direct fine-tuning on task-specific image-text pairs.
- ◆Single LLM vs. Multi-Agent: The superiority of multi-agent collaboration over well-prompted single LLMs is not consistently decisive. Given increased overhead, its added value must be carefully justified.
Lay Summary Generation Findings
- ◆Conventional Models Excel: Fine-tuned sequence-to-sequence models consistently achieve high scores on automated metrics across various datasets.
- ◆LLM and Multi-Agent Performance: While single LLMs can produce fluent summaries, they, along with multi-agent frameworks, do not consistently surpass fine-tuned conventional models on automated metrics.
- ◆Multi-agent Shows Limited Gains: The tested multi-agent approach did not demonstrate a clear advantage over well-prompted single LLMs or leading conventional methods.
EHR Predictive Modeling Findings
- ◆Conventional Methods Reign Supreme: Specialized conventional models (sequence-based DL, ensemble methods) exhibit significantly superior predictive performance.
- ◆Advanced LLMs Show Potential but Lag: SOTA LLMs demonstrate notable zero-shot capabilities but do not match trained conventional models.
- ◆Multi-Agent Offers Limited Gains: Multi-agent systems generally improve over their base LLM but don't consistently outperform the best single LLMs or conventional methods.
Clinical Workflow Automation Findings
- ◆Multi-agent Systems Improve Completeness: Frameworks like SmolAgent and OpenManus generally achieve higher rates of task completion, particularly in generating modeling code, visualizations, and reports.
- ◆Overall Correctness Remains Low: Despite improvements in completeness, the rate of "Correct" end-to-end solutions is modest across all methods.
- ◆Data Extraction Most Successfully Automated: Basic data manipulation tasks see higher success rates, with performance degrading for more complex workflow stages.
Resources
Citation
@article{zhu2025medagentboard, title={Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks}, author={Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Liao, Weibin and Gao, Junyi and Ma, Liantao and Yu, Lequan}, journal={arXiv preprint}, year={2025} % TODO: Add arXiv ID e.g., archivePrefix={arXiv}, eprint={YOUR_ARXIV_ID} }