MedAgentBoard: Benchmarking Multi-Agent Collaboration

Abstract

The rapid advancement of Large Language Models (LLMs) has spurred interest in multi-agent collaboration for tackling complex medical tasks. However, the practical advantages of multi-agent collaboration approaches are not yet fully understood, as existing evaluations often lack generalizability across diverse clinical applications and omit rigorous comparisons against both single LLMs and well-established conventional methods.

To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, utilizing text, medical images, and structured EHR data.

Our extensive experiments reveal a nuanced landscape: while multi-agent frameworks demonstrate benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, they do not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods, which maintain superior performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard provides a vital resource and actionable insights, emphasizing the need for a task-specific, evidence-based approach to selecting and developing AI solutions in medicine, and underscoring that the additional complexity of multi-agent collaboration must be carefully justified against its tangible benefits.

MedAgentBoard Framework

The illustrative overview of MedAgentBoard. It is structured around four distinct medical task categories, chosen to represent a diverse range of clinical needs, data modalities, and reasoning complexities. For each task, we aim to compare multi-agent collaboration, single LLM approaches, and strong conventional baselines.

Medical (Visual) Question Answering

This task evaluates the ability of AI systems to answer questions based on medical textual knowledge (QA) or a combination of visual and textual inputs (VQA), such as radiological images or pathology slides.

Datasets

MedQA
PubMedQA
PathVQA
VQA-RAD

Evaluations

Accuracy (multiple-choice)
LLM-as-a-judge scoring
Clinical relevance
Factual consistency

Methods

Conventional: BioLinkBERT, GatorTron, M³AE
Single LLM: Zero-shot, ICL, CoT
Multi-agent: MedAgents, ReConcile

Lay Summary Generation

Lay summary centers on transforming complex medical texts, such as research articles, into versions that are accurate, concise, and readily comprehensible to a non-expert broader audience.

Datasets

Cochrane
eLife and PLOS
Med-EASi
PLABA

Evaluations

ROUGE-L
SARI

Methods

Conventional: T5, PEGASUS, BART
Single LLM: Zero-shot, ICL
Multi-agent: AgentSimp-inspired

EHR Predictive Modeling

This task focuses on predicting patient-specific clinical outcomes using structured EHR data, including in-hospital patient mortality and 30-day hospital readmission.

Datasets

MIMIC-IV
Tongji Hospital (TJH)

Evaluations

AUROC
AUPRC

Methods

Conventional: DT, XGBoost, GRU, LSTM
Single LLM: Zero-shot prompting
Multi-agent: QA-like frameworks

Clinical Workflow Automation

Clinical workflow automation evaluates AI systems' capabilities in handling routine to complex clinical data analysis tasks that traditionally require significant clinical expertise.

Datasets

MIMIC-IV
Tongji Hospital (TJH)
100 analytical questions

Evaluations

Expert assessment
Correctness of data extraction
Appropriateness of modeling
Quality of visualization
Completeness of reports

Methods

Single LLM
Multi-agent: SmolAgents, OpenManus, Owl

Key Findings

Medical QA and VQA Findings

◆Medical QA: LLM-based approaches, particularly with advanced prompting, significantly outperform conventional methods. While multi-agent collaboration frameworks are competitive, highly capable single LLMs can achieve SOTA results.
◆Medical VQA: Specialized conventional Vision-Language Models remain dominant, likely due to direct fine-tuning on task-specific image-text pairs.
◆Single LLM vs. Multi-Agent: The superiority of multi-agent collaboration over well-prompted single LLMs is not consistently decisive. Given increased overhead, its added value must be carefully justified.

Lay Summary Generation Findings

◆Conventional Models Excel: Fine-tuned sequence-to-sequence models consistently achieve high scores on automated metrics across various datasets.
◆LLM and Multi-Agent Performance: While single LLMs can produce fluent summaries, they, along with multi-agent frameworks, do not consistently surpass fine-tuned conventional models on automated metrics.
◆Multi-agent Shows Limited Gains: The tested multi-agent approach did not demonstrate a clear advantage over well-prompted single LLMs or leading conventional methods.

EHR Predictive Modeling Findings

◆Conventional Methods Reign Supreme: Specialized conventional models (sequence-based DL, ensemble methods) exhibit significantly superior predictive performance.
◆Advanced LLMs Show Potential but Lag: SOTA LLMs demonstrate notable zero-shot capabilities but do not match trained conventional models.
◆Multi-Agent Offers Limited Gains: Multi-agent systems generally improve over their base LLM but don't consistently outperform the best single LLMs or conventional methods.

Clinical Workflow Automation Findings

◆Multi-agent Systems Improve Completeness: Frameworks like SmolAgent and OpenManus generally achieve higher rates of task completion, particularly in generating modeling code, visualizations, and reports.
◆Overall Correctness Remains Low: Despite improvements in completeness, the rate of "Correct" end-to-end solutions is modest across all methods.
◆Data Extraction Most Successfully Automated: Basic data manipulation tasks see higher success rates, with performance degrading for more complex workflow stages.

Citation

@article{zhu2025medagentboard,
  title={{MedAgentBoard}: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks},
  author={Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Gao, Junyi and Ma, Liantao and Yu, Lequan},
  journal={arXiv preprint arXiv:2505.12371},
  year={2025}
}

MedAgentBoard

Abstract

MedAgentBoard Framework

Medical (Visual) Question Answering

Datasets

Evaluations

Methods

Lay Summary Generation

Datasets

Evaluations

Methods

EHR Predictive Modeling

Datasets

Evaluations

Methods

Clinical Workflow Automation

Datasets

Evaluations

Methods

Key Findings

Medical QA and VQA Findings

Lay Summary Generation Findings

EHR Predictive Modeling Findings

Clinical Workflow Automation Findings

Resources

Paper

Code

Data & Prompts

Citation