| 9.5 |
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems |
Multi-Device Agent, Hierarchical Replanning, Agent Recovery |
| 9.5 |
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems |
agent safety, multi-turn red-teaming, jailbreak benchmark |
| 9.5 |
ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments |
学术搜索Agent, Agent评测基准, 信息检索Agent |
| 9.5 |
Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services |
Agent工具使用, Agent规划, Agent执行优化 |
| 9.5 |
MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments |
多智能体协作, 对抗训练, 深度研究Agent |
| 9.5 |
Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience |
多智能体辩论, 对抗鲁棒性, 异构LLM |
| 9.5 |
SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design |
多智能体系统, 技能组合, 图神经网络 |
| 9.0 |
LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents |
Tool-Use Agent, State Management, Policy Compliance |
| 9.0 |
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution |
Agent自我进化, 内存驱动, 边际优势累积 |
| 9.0 |
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization |
多智能体协作, 医学AI Agent, 递归推理 |
| 9.0 |
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning [Code] |
LLM Agent, 强化学习, 长周期Agent |
| 9.0 |
Multi-Agent Transactive Memory |
多智能体系统, 记忆共享, 检索增强生成 |
| 9.0 |
AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts |
Agent记忆系统, 记忆增强, 原子事实提取 |
| 9.0 |
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End? |
LLM Agent, Operations Research, Benchmark |
| 9.0 |
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents |
Agent评估, 预测效度, 基准设计 |
| 8.5 |
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems |
多智能体系统, 评估偏差传播, LLM评估器 |
| 8.5 |
UltraQuant: 4-bit KV Caching for Context-Heavy Agents |
LLM Agent, KV-Cache压缩, 4-bit量化 |
| 8.5 |
SoftSkill: Behavioral Compression for Contextual Adaptation |
Agent skill compression, Soft prompt tuning, Frozen backbone agent |
| 8.5 |
ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research |
LLM Agent, Deep Research Agent, Multi-Round Retrieval |
| 8.5 |
When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents [Code] |
Agent安全, 工具使用, 权限最小化 |
| 8.5 |
AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA |
Multi-Agent pipeline, Financial chart QA, Auditability |
| 8 |
Probe-and-Refine Tuning of Repository Guidance for Coding Agents |
Coding Agent, Repository Guidance, Probe-and-Refine Tuning |
| 7.5 |
Efficient and Sound Probabilistic Verification for AI Agents |
Agent 安全, 运行时监控, 概率验证 |
| 7.5 |
Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs |
Multi-Agent Systems, Code Agent, Software Engineering Agent |
| 7.5 |
N-Version Programming with Coding Agents |
Coding Agent, 多版本编程, 代理多样性 |
| 7.5 |
VIMPO: Value-Implicit Policy Optimization for LLMs |
LLM Agent 推理增强, 强化学习, 策略优化 |
| 7.5 |
A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models |
Black-Box UE, Multi-Agent, Uncertainty Estimation |
| 7.5 |
Large Language Models Do Not Always Need Readable Language |
LLM Agent, 跨智能体通信, Agent记忆 |