Similar

Introduction

As multimodal large language models (MLLMs) advance, Generalist Virtual Agents (GVAs) face challenges like reliance on outcome-based rewards and labor-intensive manual annotations, lacking fine-grained process supervision and inference-time scalability.

We propose Similar, a step-wise, multi-dimensional reward model defining five evaluation dimensions (Helpfulness, Odds of Success, Efficiency, Task Relevance, Coherence), paired with an MCTS-P algorithm to automatically collect cross-platform annotated data.

Introducing SRM, the first benchmark for step-wise, multi-dimensional reward model evaluation, comprising 78k-training (SRMTrain) and 32k-test (SRMEval) datasets across Web, Android, Linux, and Windows.

Experiments show Similar achieves 61.2% avg. accuracy on SRMEval, outperforming baselines by 13.2%, and boosts task success rates by up to 35.9% during inference, demonstrating its effectiveness in guiding GVA training and scaling.

Overview

GVAs are designed to process multimodal inputs (e.g., UI elements, text, visuals) and navigate digital environments to perform tasks. However, traditional training methods rely heavily on outcome-based rewards, which lack the granularity needed to guide agents effectively through complex tasks. To overcome these challenges, we propose Similar, a reward model that provides multi-dimensional, step-wise supervision signals to guide agent learning and reasoning.

Traditional coarse-grained outcome-based labor-intensive paradigm vs. Our fine-grained process-based autonomous paradigm

Similar model training pipeline. First, we systematically define five dimensions to describe the quality of an agent’s step. Next, we propose an MCTS-P algorithm to automatically collect annotated step-wise data. Finally, we design the Triple-M strategy to train the

Similar model, which can guide the agent during both the training and inference phases.

Applications

A case of

Similar provides guidance for GVA training and inference.

Similar

Training Phase: As a reward model in reinforcement learning frameworks, Similar provides fine-grained feedback to optimize agent behavior, improving task performance;
Inference Phase: Integrated with search algorithms like MCTS, Similar enhances reasoning efficiency, ensuring that actions lead to successful task completion.

Contributions

(1) Five Key Dimensions: We define five dimensions for step-wise GVA assessment and introduce an automated framework using MCTS-P to collect fine-grained, cross-platform reward model data annotations.
(2) Triple-M Strategy: We propose a Triple-M strategy to train Similar, integrating multiple dimensions and generating synergistic gains for robust, fine-grained feedback.
(3) SRM Benchmark: We introduce SRMEval, a multi-step, multi-dimensional benchmark for evaluating reward models, advancing research in reward model performance assessment.
(4) Superior Performance: Our approach achieves superior GVA performance across diverse tasks and environments, demonstrating the effectiveness of step-wise multi-dimensional assessment and synergistic expert integration.

Overview

Since reward models are crucial for enhancing GVAs' performance, and prior research has not focused on evaluating specific reward models, we propose SRMEval, the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.

We proposed a new task for reward models in the virtual agent domain: Selecting the better action (i.e., the chosen action) from two actions at step i for a specific evaluation dimension. The evaluation metric is Accuracy, measuring the reward model's ability to select the better action. Accuracy is calculated under a specific evaluation type.

Benchmark Cases

Cases of

SRMEval.

BibTeX


@misc{miao2025boostingvirtualagentlearning,
      title={Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark},
      author={Bingchen Miao and Yang Wu and Minghe Gao and Qifan Yu and Wendong Bu and Wenqiao Zhang and Yunfei Li and Siliang Tang and Tat-Seng Chua and Juncheng Li},
      year={2025},
      eprint={2503.18665},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18665},
}