[Paper Review] The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey (2024.04)

WildBench: WildChat 데이터셋 기반 (570,000개의 실제 ChatGPT 대화). 다양한 작업과 프롬프트 포함 &rarr; 넓은 주제 범위.
SWE-Bench: Github issue 기반으로 구성된 Python 소프트웨어 엔지니어링 관련 벤치마크. Python 이외의 언어, 일반적 문제 해결 능력 평가에는 제한적.

2025. 4. 18. 19:49· Paper Review/Agent

[Paper Review] MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making (2024.10) (3)	2025.06.13

Introduction