Logo Similar

Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark

1Zhejiang University, 2Ant Group, 3National University of Singapore
*Equal Contribution, βœ‰Corresponding Author
ICML 2025

Introduction

As multimodal large language models (MLLMs) advance, Generalist Virtual Agents (GVAs) face challenges like reliance on outcome-based rewards and labor-intensive manual annotations, lacking fine-grained process supervision and inference-time scalability.

We propose LogoSimilar, a step-wise, multi-dimensional reward model defining five evaluation dimensions (Helpfulness, Odds of Success, Efficiency, Task Relevance, Coherence), paired with an MCTS-P algorithm to automatically collect cross-platform annotated data.

Introducing LogoSRM, the first benchmark for step-wise, multi-dimensional reward model evaluation, comprising 78k-training (LogoSRMTrain) and 32k-test (LogoSRMEval) datasets across Web, Android, Linux, and Windows.

Experiments show Similar achieves 61.2% avg. accuracy on LogoSRMEval, outperforming baselines by 13.2%, and boosts task success rates by up to 35.9% during inference, demonstrating its effectiveness in guiding GVA training and scaling.

LogoSimilar

Overview

GVAs are designed to process multimodal inputs (e.g., UI elements, text, visuals) and navigate digital environments to perform tasks. However, traditional training methods rely heavily on outcome-based rewards, which lack the granularity needed to guide agents effectively through complex tasks. To overcome these challenges, we propose LogoSimilar, a reward model that provides multi-dimensional, step-wise supervision signals to guide agent learning and reasoning.

algebraic reasoning
Traditional coarse-grained outcome-based labor-intensive paradigm vs. Our fine-grained process-based autonomous paradigm
algebraic reasoning
LogoSimilar model training pipeline. First, we systematically define five dimensions to describe the quality of an agent’s step. Next, we propose an MCTS-P algorithm to automatically collect annotated step-wise data. Finally, we design the Triple-M strategy to train the LogoSimilar model, which can guide the agent during both the training and inference phases.

Applications

arithmetic reasoning
A case of LogoSimilar provides guidance for GVA training and inference.

    LogoSimilar enhances GVA performance in two key phases:
  • Training Phase: As a reward model in reinforcement learning frameworks, LogoSimilar provides fine-grained feedback to optimize agent behavior, improving task performance;
  • Inference Phase: Integrated with search algorithms like MCTS, LogoSimilar enhances reasoning efficiency, ensuring that actions lead to successful task completion.

Contributions


(1) Five Key Dimensions: We define five dimensions for step-wise GVA assessment and introduce an automated framework using MCTS-P to collect fine-grained, cross-platform reward model data annotations.
(2) Triple-M Strategy: We propose a Triple-M strategy to train LogoSimilar, integrating multiple dimensions and generating synergistic gains for robust, fine-grained feedback.
(3) SRM Benchmark: We introduce LogoSRMEval, a multi-step, multi-dimensional benchmark for evaluating reward models, advancing research in reward model performance assessment.
(4) Superior Performance: Our approach achieves superior GVA performance across diverse tasks and environments, demonstrating the effectiveness of step-wise multi-dimensional assessment and synergistic expert integration.

LogoSRMEval

Overview

Since reward models are crucial for enhancing GVAs' performance, and prior research has not focused on evaluating specific reward models, we propose LogoSRMEval, the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.

We proposed a new task for reward models in the virtual agent domain: Selecting the better action (i.e., the chosen action) from two actions at step i for a specific evaluation dimension. The evaluation metric is Accuracy, measuring the reward model's ability to select the better action. Accuracy is calculated under a specific evaluation type.

Benchmark Cases

arithmetic reasoning
Cases of LogoSRMEval.

BibTeX


@misc{miao2025boostingvirtualagentlearning,
      title={Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark},
      author={Bingchen Miao and Yang Wu and Minghe Gao and Qifan Yu and Wendong Bu and Wenqiao Zhang and Yunfei Li and Siliang Tang and Tat-Seng Chua and Juncheng Li},
      year={2025},
      eprint={2503.18665},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18665},
}