Expertise Need Not Monopolize:
Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Weijie Shen1,2,8*, Yitian Liu1,3*, Yuhao Wu5,8*, Zhixuan Liang6,4*†,
Sijia Gu7, Dehui Wang1,2,8, Tian Nian1, Lei Xu3,10, Yusen Qin8, Jiangmiao Pang4, Xinping Guan2,9,
Xiaokang Yang1,3†, Yao Mu1,3,4†
*Equal contribution   Corresponding authors.
1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, 2School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, 3School of Computer Science, Shanghai Jiao Tong University, 4Shanghai AI Laboratory, 5Tsinghua Shenzhen International Graduate School, Tsinghua University, 6The University of Hong Kong, 7Tongji University, 8D-Robotics, 9Key Laboratory of System Control and Information Processing, Ministry of Education of China, 10Shanghai Key Laboratory of Integrated Administration Technologies for Information Security

The AdaMoE-VLA Model

Our AdaMoE-VLA framework builds upon a pretrained flow-matching Vision-Language-Action (VLA) model π0, enhancing its action reasoning capacity through Mixture-of-Experts (MoE) integration. Our pipeline (Fig. a) processes multi-modal inputs through a vision-language model and a transformer-based action expert to predict continuous control. Each transformer block in the action expert incorporates an MoE layer (Fig. b) comprising shared experts, which inherit the original FFN weights to capture general manipulation patterns, and routed experts, dynamically selected by a router network to specialize in action-specific behaviors.

As shown in Fig. c, the vanilla MoE couples expert selection and weighting through a single router using top-k softmax gating. In contrast, our AdaMoE-VLA (Fig. d) introduces a decoupled architecture with an independent router for expert selection and a scale adapter for adaptive weighting. This design alleviates optimization conflicts between load balancing and task specialization, enabling more flexible and expressive expert collaboration for robust robotic control.

Experiments

Evaluations on simulation benchmarks

We select two simulation benchmarks to evaluate our method: (1) Four task suites from LIBERO dataset: LIBERO-Spatial, LIBERO-Object, LIBERO Goal and LIBERO-Long. (2) 19 tasks from RoboTwin 2.0. Each task dataset contains 100 expert trajectories from Clean environments and 400 expert trajectories from Domain Randomized environments.

Our results demonstrate clear performance improvements of MoE over dense models, with particularly pronounced gains on large-scale datasets and long-horizon tasks. On the LIBERO benchmark, our AdaMoE-VLA achieves an average improvement of 1.8% over the baseline π0 model (94.2% → 96.0%) across all four task suites, as shown in Table 1. As detailed in Table 2, the improvements are more significant on the large-scale RoboTwin dataset, where we observe a substantial 9.3% performance gain (40.4% → 49.7%) across 19 manipulation tasks with 9500 demonstrations.

Notably, our method excels in both domain randomized tasks and long-horizon sequential tasks. In domain randomized scenarios with high environmental and object variation, the diverse expert specialization enables better handling of different lighting conditions, object properties, poses, and manipulation strategies across diverse configurations. The performance gains on long-horizon tasks are particularly pronounced, with our method achieving a 92% success rate on LIBERO-Long, demonstrating that MoE architectures can effectively decompose complex sequential manipulation into specialized sub-skills handled by different experts.


Meaningful Expert Specialization

Analysis of expert activation patterns reveals clear task-dependent specialization across different manipulation phases. The above figure shows the activation patterns of experts at certain layer L during various manipulation tasks, where expert usage intensity measures the proportion of tokens assigned to each expert at each frame. We observe distinct activation patterns that correlate with specific manipulation phases. For the same task “put both the alphabet soup and the tomato sauce”, all experts show similar token load distributions as illustrated in subfigures (a) and (b). Furthermore, across different tasks, experts exhibit consistent trends for certain atomic operations. For instance, in subfigures (a), (b), and (c), Expert 3 shows increased token utilization precisely when the policy performs target positioning and gripper release operations. The consistency of activation patterns across similar manipulation phases demonstrates that our experts capture meaningful behavioral primitives rather than arbitrary task divisions.


Effectiveness of Our Decoupled Architecture Design

To validate our decoupled expert selection and weighting mechanism, we conduct comprehensive ablation studies on LIBERO comparing three architectural variants:

  • Vanilla MoE: Traditional MoE with coupled selection and weighting using softmax router outputs
  • Concatenated Scale Adapter MoE (CSMoE): Router outputs and action tokens are concatenated and fed to a scale adapter that directly outputs expert weights
  • Additive Scale Adapter MoE (Our AdaMoE-VLA): Expert weights are computed as the sum of router weights and scale adapter weights

As shown in Table 3, our AdaMoE-VLA achieves the best overall performance across LIBERO task suites, with an average improvement of 1.6% over vanilla MoE (load balance). The concatenated approach shows moderate improvements, validating the importance of decoupling, while our additive design proves most effective.


Real-World Experiments

To validate the practical effectiveness of our AdaMoE-VLA approach, we conduct real-world robotic manipulation experiments using a dual-arm manipulation platform. Our experimental setup utilizes the ALOHA-Agilex system developed by AgileX Robotics, equipped with two Piper robotic arms that enable bimanual manipulation capabilities.

We design four representative manipulation tasks that cover diverse manipulation skills and evaluate our method's performance in real-world scenarios:

1) Place Cup: Precise positioning
3) Click Bell: Coordinated activation

2) Stack Plate: Stable stacking
4) Adjust Bottle: Fine orientation

Table 5 presents the success rates of our AdaMoE-VLA compared to the π0 baseline across all four real-world manipulation tasks. Our method demonstrates consistent improvements across all tasks, with particularly notable gains in complex manipulation scenarios requiring precise coordination.

Below are videos of our AdaMoE-VLA model demonstrating various robust behaviors across the four representative manipulation tasks. (Videos are sped up by 2×.)

Place Cup
Stack Plate
Click Bell
Adjust Bottle


BibTeX

@misc{shen2025expertiseneedmonopolizeactionspecialized,
            title={Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning}, 
            author={Weijie Shen and Yitian Liu and Yuhao Wu and Zhixuan Liang and Sijia Gu and Dehui Wang and Tian Nian and Lei Xu and Yusen Qin and Jiangmiao Pang and Xinping Guan and Xiaokang Yang and Yao Mu},
            year={2025},
            eprint={2510.14300},
            archivePrefix={arXiv},
            primaryClass={cs.RO},
            url={https://arxiv.org/abs/2510.14300}, 
      }